query와 searcher로 검색 수행 과정

Hits(Searcher s, Query q, Filter f) throws IOException {
weight = q.weight(s);
searcher = s;
filter = f;
getMoreDocs(50); // retrieve 100 initially
}

weight = q.weight(s); 여기서 TF, IDF를 구한다.

getMoreDocs(50);에서 해당 Document를 가져온다.

그러면 weight를 좀더 확인해보자.

public Weight weight(Searcher searcher)
throws IOException {
Query query = searcher.rewrite(this);
Weight weight = query.createWeight(searcher);
float sum = weight.sumOfSquaredWeights();
float norm = getSimilarity(searcher).queryNorm(sum);
weight.normalize(norm);
return weight;
}
Weight weight = query.createWeight(searcher); 여기서 IDF를 가져온다.

그러면 createWeight 이걸 봐야겠군.

protected Weight createWeight(Searcher searcher) throws IOException {
return new TermWeight(searcher);
}

TermWeight로 넘긴 후 similarity.idf(term, searcher) 에서 IDF를 구한다.
getSimilarity 이건 뭐지?

public TermWeight(Searcher searcher)
throws IOException {
this.similarity = getSimilarity(searcher);
idf = similarity.idf(term, searcher); // compute idf
}

public float idf(Term term, Searcher searcher) throws IOException {
return idf(searcher.docFreq(term), searcher.maxDoc());
}

docFreq 함수에서 tis는 어디서 나온거지?
public int docFreq(Term t) throws IOException {
TermInfo ti = tis.get(t);
if (ti != null)
return ti.docFreq;
else
return 0;
}

이거 정말 돌고 도는군...
Term을 입력하면 아래 정보들을 tis 파일에서 가져온다.
int docFreq = 0;
long freqPointer = 0;
long proxPointer = 0;
int skipOffset;

/** Returns the TermInfo for a Term in the set, or null. */
TermInfo get(Term term) throws IOException {
if (size == 0) return null;

ensureIndexIsRead();

// optimize sequential access: first try scanning cached enum w/o seeking
SegmentTermEnum enumerator = getEnum();
if (enumerator.term() != null // term is at or past current
&& ((enumerator.prev() != null && term.compareTo(enumerator.prev())> 0)
|| term.compareTo(enumerator.term()) >= 0)) {
int enumOffset = (int)(enumerator.position/enumerator.indexInterval)+1;
if (indexTerms.length == enumOffset // but before end of block
|| term.compareTo(indexTerms[enumOffset]) < 0)
return scanEnum(term); // no need to seek
}

// random-access: must seek
seekEnum(getIndexOffset(term));
return scanEnum(term);
}

TIS 파일에서 어떻게 일치하는 term의 정보를 가져올까?
여기만 분석하면 되는데..

origEnum = new SegmentTermEnum(directory.openInput(segment + ".tis"),
fieldInfos, false);
size = origEnum.size;

indexEnum =
new SegmentTermEnum(directory.openInput(segment + ".tii"),
fieldInfos, true);

tis와 tii의 관계를 좀 더 알아봐야겠다.
그러면 어떤 구조로 term의 정보를 가져오는지 알 수 있을 것 같다.

바로 이거야.
binary search 로 indexterm을 찾는다는데. 이게 binary search인가? 별 특별하게 보이진 않네.
/** Returns the offset of the greatest index entry which is less than or equal to term.*/
private final int getIndexOffset(Term term) {
int lo = 0; // binary search indexTerms[]
int hi = indexTerms.length - 1;

while (hi >= lo) {
int mid = (lo + hi) >> 1;
int delta = term.compareTo(indexTerms[mid]);
if (delta < 0)
hi = mid - 1;
else if (delta > 0)
lo = mid + 1;
else
return mid;
}
return hi;
}

결국엔 terms에서 term을 찾기 위해서 별도의 인덱스를 가지고 있는게 아니라 정렬된 인덱스에서 binary search로 찾는다는 것이다.

tii, tis 파일의 성격이 좀 다르다.
tis에 모든 terminfo가 들어가 있다. tii의 경우 128번째의 정보만 들어가 있다.
TermInfoWriter Class의 add method에서보면 알 수 있다.
if (!isIndex && size % indexInterval == 0)
other.add(lastTerm, lastTi); // add an index term

자 그렇다면 사용자가 검색했을때 term frequence를 구하기 위해서 저장된 위치를찾을때
원래 문서의 건수에 128을 나눈 값에 binary search에서 필요한 횟수를 곱하고 실제 tis 파일에서 검색하는 횟수 128을 더하면 연산되는 수를 알 수 있다.
이것만 찾으면 TF, IDF를 구할 수 있다.

자.. 이렇게 되면 세미나 준비는 완료된 셈이다.

PPT 파일을 만들어보자.

'IT-Consultant' 카테고리의 다른 글

왜 Zone 검색이 equeal 검색보다 빠를 수 있을까? (0)	2007.06.15
query와 searcher로 검색 수행 과정 (0)	2007.06.13
특정 Term에 대한 Terminfo 찾기 (0)	2007.06.11
특정 Term에 대한 Terminfo 찾기 (0)	2007.06.11
Lucene에서 TF, IDF 구하는 소스 (0)	2007.06.11

인생은 스트리트 파이터 처럼

query와 searcher로 검색 수행 과정

'IT-Consultant' 카테고리의 다른 글

티스토리툴바

query와 searcher로 검색 수행 과정

'IT-Consultant' 카테고리의 다른 글

'IT-Consultant' Related Articles

티스토리툴바