본문 바로가기

IT-Consultant

루씬 색인 프로세스 정리 Document가 들어오면 DocumentWriter Class의 addDocument 함수에서 invertDocument(doc) 함수를 Call하고 sortPostingTable()함수에서 소팅처리한후 writePostings에서 저장한다. 루씬에서 Term Posting 리스트를 보고 싶다면 DocumentWriter Class에서 이 부분을 주석을 풀면 된다. /* for (int i = 0; i < postings.length; i++) { Posting posting = postings[i]; System.out.print(posting.term); System.out.print(" freq=" + posting.freq); System.out.print(" pos="); System.out... 더보기
Posting List 소팅시 quickSort 사용 private static final void quickSort(Posting[] postings, int lo, int hi) { if (lo >= hi) return; int mid = (lo + hi) / 2; if (postings[lo].term.compareTo(postings[mid].term) > 0) { Posting tmp = postings[lo]; postings[lo] = postings[mid]; postings[mid] = tmp; } if (postings[mid].term.compareTo(postings[hi].term) > 0) { Posting tmp = postings[mid]; postings[mid] = postings[hi]; postings[hi] = tmp;.. 더보기
Posting List 소팅시 quickSort 사용 private static final void quickSort(Posting[] postings, int lo, int hi) { if (lo >= hi) return; int mid = (lo + hi) / 2; if (postings[lo].term.compareTo(postings[mid].term) > 0) { Posting tmp = postings[lo]; postings[lo] = postings[mid]; postings[mid] = tmp; } if (postings[mid].term.compareTo(postings[hi].term) > 0) { Posting tmp = postings[mid]; postings[mid] = postings[hi]; postings[hi] = tmp;.. 더보기
invertDocument(Tokenizes the fields of a document into Postings) 루씬 색인기에서 가장 핵심은 이곳인것 같다. // Tokenizes the fields of a document into Postings. private final void invertDocument(Document doc) throws IOException { Iterator fieldIterator = doc.getFields().iterator(); while (fieldIterator.hasNext()) { Fieldable field = (Fieldable) fieldIterator.next(); String fieldName = field.name(); int fieldNumber = fieldInfos.fieldNumber(fieldName); int length = fieldLengths[.. 더보기
invertDocument(Tokenizes the fields of a document into Postings) 루씬 색인기에서 가장 핵심은 이곳인것 같다. // Tokenizes the fields of a document into Postings. private final void invertDocument(Document doc) throws IOException { Iterator fieldIterator = doc.getFields().iterator(); while (fieldIterator.hasNext()) { Fieldable field = (Fieldable) fieldIterator.next(); String fieldName = field.name(); int fieldNumber = fieldInfos.fieldNumber(fieldName); int length = fieldLengths[.. 더보기
최종적으로 만들어진 Posting List를 어떻게 파일에 쓸까? private final void writePostings(Posting[] postings, String segment) throws IOException { IndexOutput freq = null, prox = null; TermInfosWriter tis = null; TermVectorsWriter termVectorWriter = null; try { //open files for inverse index storage freq = directory.createOutput(segment + ".frq"); prox = directory.createOutput(segment + ".prx"); tis = new TermInfosWriter(directory, segment, fieldInfos.. 더보기
최종적으로 만들어진 Posting List를 어떻게 파일에 쓸까? private final void writePostings(Posting[] postings, String segment) throws IOException { IndexOutput freq = null, prox = null; TermInfosWriter tis = null; TermVectorsWriter termVectorWriter = null; try { //open files for inverse index storage freq = directory.createOutput(segment + ".frq"); prox = directory.createOutput(segment + ".prx"); tis = new TermInfosWriter(directory, segment, fieldInfos.. 더보기
Inverted Index Strategies batch-based: use file-sorting algorithms (textbook) + fastest to build + fastest to search - slow to update b-tree based: update in place (http://www.lucene.com/papers/sigir90.ps) + fast to search - update/build does not scale - complex implementation segment based: lots of small indexes (Verity) + fast to build + fast to update - slower to search hash-file based (Ultraseek ISTK?) + fast to buil.. 더보기
Inverted Index Strategies batch-based: use file-sorting algorithms (textbook) + fastest to build + fastest to search - slow to update b-tree based: update in place (http://www.lucene.com/papers/sigir90.ps) + fast to search - update/build does not scale - complex implementation segment based: lots of small indexes (Verity) + fast to build + fast to update - slower to search hash-file based (Ultraseek ISTK?) + fast to buil.. 더보기
TF, IDF 구현 /** Implemented as sqrt(freq). */ public float tf(float freq) { return (float)Math.sqrt(freq); } /** Implemented as log(numDocs/(docFreq+1)) + 1. */ public float idf(int docFreq, int numDocs) { return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0); } 더보기