Lucene对于多个IndexReader中全局DF的处理

研究了一段时间的Nutch,对于索引的分布式构建有几点困惑:
1. 分布式索引如何处理全局信息,比如每个分布式索引中的term有自己的DF,在对多个索引进行搜索时,是否会合并这些DF。这个问题通过下面的验证得到了解决。
2. 会不会有同一个文档出现在多个索引中的情况。
提出这个问题主要是刚开始对Hadoop的机制不了解,通过设置Reducer可以保证同一个网页不会被处理两次,也就是不会在两个索引中出现。
3. DocID如何分配。这个问题还没有找到答案。刚和师兄讨论了下,发现自己把问题想复杂了,其实不用考虑全局的DocID,每个IndexReader都在自己内部排序,然后把所有IndexReader的排序结果合并即可,因为每个IndexReader算出的socre已经是全局score了(参见问题1)。关于索引合并的问题,个人觉得应该是通过DocID偏移量来实现的,还需要阅读Lucene的源代码知道实际是怎么实现的。

第1个问题求解。
初始化两个IndexReader,每个IndexReader添加一个Document,每个Document都有域content,一个content值为"one",另一个content值为"one two"。使用MultiReader对两个IndexReader同时进行搜索。
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
IndexWriterConfig config1 = new IndexWriterConfig(Version.LUCENE_33, analyzer);
IndexWriterConfig config2 = new IndexWriterConfig(Version.LUCENE_33, analyzer);
		
directory1 = new RAMDirectory();
directory2 = new RAMDirectory();
		
writer1 = new IndexWriter(directory1, config1);
writer2 = new IndexWriter(directory2, config2);
		
Document doc = new Document();
doc.add(new Field("content", "one two", Field.Store.YES, Field.Index.ANALYZED));
writer1.addDocument(doc);
writer1.close();
		
doc = new Document();
doc.add(new Field("content", "two", Field.Store.YES, Field.Index.ANALYZED));
writer2.addDocument(doc);
writer2.close();
		
IndexReader[] readers = new IndexReader[2];
readers[0] = IndexReader.open(directory1);
readers[1] = IndexReader.open(directory2);
searcher = new IndexSearcher(new MultiReader(readers));


构造查询"content:two",如果MultiReader支持自动合并两个IndexReader中的DF,则通过结果的Explanation可以看到查询"content:two"的DF为2,否则为1。
Query query = new TermQuery(new Term("content", "two"));
TopDocs topDocs = searcher.search(query, 10);
for (int i = 0; i < topDocs.scoreDocs.length; ++i) {
    ScoreDoc match = topDocs.scoreDocs[i];
    Explanation explanation = searcher.explain(query, match.doc);
    Document doc = searcher.doc(i);
    System.out.println(doc.get("content"));
    System.out.println(explanation.toString());
}


输出结果如下:
Document: one two
0.5945348 = (MATCH) weight(content:two in 0), product of:
  0.99999994 = queryWeight(content:two), product of:
    0.5945349 = idf(docFreq=2, maxDocs=2)
    1.681987 = queryNorm
  0.5945349 = (MATCH) fieldWeight(content:two in 0), product of:
    1.0 = tf(termFreq(content:two)=1)
    0.5945349 = idf(docFreq=2, maxDocs=2)
    1.0 = fieldNorm(field=content, doc=0)

Document: two
0.37158427 = (MATCH) weight(content:two in 0), product of:
  0.99999994 = queryWeight(content:two), product of:
    0.5945349 = idf(docFreq=2, maxDocs=2)
    1.681987 = queryNorm
  0.3715843 = (MATCH) fieldWeight(content:two in 0), product of:
    1.0 = tf(termFreq(content:two)=1)
    0.5945349 = idf(docFreq=2, maxDocs=2)
    0.625 = fieldNorm(field=content, doc=0)


在Explanation中可以看到,"two"的docFreq(DF)=2,表明MulitReader对象支持合并多个IndexReader的DF得到全局的DF。

猜你喜欢

转载自nepshi.iteye.com/blog/1242026
今日推荐