Lucene对于多个IndexReader中全局DF的处理

研究了一段时间的Nutch，对于索引的分布式构建有几点困惑：
1. 分布式索引如何处理全局信息，比如每个分布式索引中的term有自己的DF，在对多个索引进行搜索时，是否会合并这些DF。这个问题通过下面的验证得到了解决。
2. 会不会有同一个文档出现在多个索引中的情况。
提出这个问题主要是刚开始对Hadoop的机制不了解，通过设置Reducer可以保证同一个网页不会被处理两次，也就是不会在两个索引中出现。
3. DocID如何分配。这个问题还没有找到答案。刚和师兄讨论了下，发现自己把问题想复杂了，其实不用考虑全局的DocID，每个IndexReader都在自己内部排序，然后把所有IndexReader的排序结果合并即可，因为每个IndexReader算出的socre已经是全局score了（参见问题1）。关于索引合并的问题，个人觉得应该是通过DocID偏移量来实现的，还需要阅读Lucene的源代码知道实际是怎么实现的。

第1个问题求解。
初始化两个IndexReader，每个IndexReader添加一个Document，每个Document都有域content，一个content值为"one"，另一个content值为"one two"。使用MultiReader对两个IndexReader同时进行搜索。

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
IndexWriterConfig config1 = new IndexWriterConfig(Version.LUCENE_33, analyzer);
IndexWriterConfig config2 = new IndexWriterConfig(Version.LUCENE_33, analyzer);
		
directory1 = new RAMDirectory();
directory2 = new RAMDirectory();
		
writer1 = new IndexWriter(directory1, config1);
writer2 = new IndexWriter(directory2, config2);
		
Document doc = new Document();
doc.add(new Field("content", "one two", Field.Store.YES, Field.Index.ANALYZED));
writer1.addDocument(doc);
writer1.close();
		
doc = new Document();
doc.add(new Field("content", "two", Field.Store.YES, Field.Index.ANALYZED));
writer2.addDocument(doc);
writer2.close();
		
IndexReader[] readers = new IndexReader[2];
readers[0] = IndexReader.open(directory1);
readers[1] = IndexReader.open(directory2);
searcher = new IndexSearcher(new MultiReader(readers));

构造查询"content:two"，如果MultiReader支持自动合并两个IndexReader中的DF，则通过结果的Explanation可以看到查询"content:two"的DF为2，否则为1。

Query query = new TermQuery(new Term("content", "two"));
TopDocs topDocs = searcher.search(query, 10);
for (int i = 0; i < topDocs.scoreDocs.length; ++i) {
    ScoreDoc match = topDocs.scoreDocs[i];
    Explanation explanation = searcher.explain(query, match.doc);
    Document doc = searcher.doc(i);
    System.out.println(doc.get("content"));
    System.out.println(explanation.toString());
}

输出结果如下：

Document: one two
0.5945348 = (MATCH) weight(content:two in 0), product of:
  0.99999994 = queryWeight(content:two), product of:
    0.5945349 = idf(docFreq=2, maxDocs=2)
    1.681987 = queryNorm
  0.5945349 = (MATCH) fieldWeight(content:two in 0), product of:
    1.0 = tf(termFreq(content:two)=1)
    0.5945349 = idf(docFreq=2, maxDocs=2)
    1.0 = fieldNorm(field=content, doc=0)

Document: two
0.37158427 = (MATCH) weight(content:two in 0), product of:
  0.99999994 = queryWeight(content:two), product of:
    0.5945349 = idf(docFreq=2, maxDocs=2)
    1.681987 = queryNorm
  0.3715843 = (MATCH) fieldWeight(content:two in 0), product of:
    1.0 = tf(termFreq(content:two)=1)
    0.5945349 = idf(docFreq=2, maxDocs=2)
    0.625 = fieldNorm(field=content, doc=0)

在Explanation中可以看到，"two"的docFreq(DF)=2，表明MulitReader对象支持合并多个IndexReader的DF得到全局的DF。

Lucene对于多个IndexReader中全局DF的处理

猜你喜欢