1. 分布式索引如何处理全局信息,比如每个分布式索引中的term有自己的DF,在对多个索引进行搜索时,是否会合并这些DF。这个问题通过下面的验证得到了解决。
2. 会不会有同一个文档出现在多个索引中的情况。
提出这个问题主要是刚开始对Hadoop的机制不了解,通过设置Reducer可以保证同一个网页不会被处理两次,也就是不会在两个索引中出现。
3. DocID如何分配。这个问题还没有找到答案。刚和师兄讨论了下,发现自己把问题想复杂了,其实不用考虑全局的DocID,每个IndexReader都在自己内部排序,然后把所有IndexReader的排序结果合并即可,因为每个IndexReader算出的socre已经是全局score了(参见问题1)。关于索引合并的问题,个人觉得应该是通过DocID偏移量来实现的,还需要阅读Lucene的源代码知道实际是怎么实现的。
第1个问题求解。
初始化两个IndexReader,每个IndexReader添加一个Document,每个Document都有域content,一个content值为"one",另一个content值为"one two"。使用MultiReader对两个IndexReader同时进行搜索。
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33); IndexWriterConfig config1 = new IndexWriterConfig(Version.LUCENE_33, analyzer); IndexWriterConfig config2 = new IndexWriterConfig(Version.LUCENE_33, analyzer); directory1 = new RAMDirectory(); directory2 = new RAMDirectory(); writer1 = new IndexWriter(directory1, config1); writer2 = new IndexWriter(directory2, config2); Document doc = new Document(); doc.add(new Field("content", "one two", Field.Store.YES, Field.Index.ANALYZED)); writer1.addDocument(doc); writer1.close(); doc = new Document(); doc.add(new Field("content", "two", Field.Store.YES, Field.Index.ANALYZED)); writer2.addDocument(doc); writer2.close(); IndexReader[] readers = new IndexReader[2]; readers[0] = IndexReader.open(directory1); readers[1] = IndexReader.open(directory2); searcher = new IndexSearcher(new MultiReader(readers));
构造查询"content:two",如果MultiReader支持自动合并两个IndexReader中的DF,则通过结果的Explanation可以看到查询"content:two"的DF为2,否则为1。
Query query = new TermQuery(new Term("content", "two")); TopDocs topDocs = searcher.search(query, 10); for (int i = 0; i < topDocs.scoreDocs.length; ++i) { ScoreDoc match = topDocs.scoreDocs[i]; Explanation explanation = searcher.explain(query, match.doc); Document doc = searcher.doc(i); System.out.println(doc.get("content")); System.out.println(explanation.toString()); }
输出结果如下:
Document: one two 0.5945348 = (MATCH) weight(content:two in 0), product of: 0.99999994 = queryWeight(content:two), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 1.681987 = queryNorm 0.5945349 = (MATCH) fieldWeight(content:two in 0), product of: 1.0 = tf(termFreq(content:two)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 1.0 = fieldNorm(field=content, doc=0) Document: two 0.37158427 = (MATCH) weight(content:two in 0), product of: 0.99999994 = queryWeight(content:two), product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 1.681987 = queryNorm 0.3715843 = (MATCH) fieldWeight(content:two in 0), product of: 1.0 = tf(termFreq(content:two)=1) 0.5945349 = idf(docFreq=2, maxDocs=2) 0.625 = fieldNorm(field=content, doc=0)
在Explanation中可以看到,"two"的docFreq(DF)=2,表明MulitReader对象支持合并多个IndexReader的DF得到全局的DF。