mmseg4j动态加载词库

 

1:schema.xml:

<!-- 中文分词mmseg4j -->
	<fieldtype name="text_mmseg4j_simple" class="solr.TextField" positionIncrementGap="100">
	    <analyzer>
		<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="/data1/SolrCloud/WordsConf/mmseg4j/words" />
<filter class="solr.LowerCaseFilterFactory"/>
	    </analyzer>
	</fieldtype>
	<fieldtype name="text_mmseg4j_complex" class="solr.TextField" positionIncrementGap="100">
	    <analyzer>
		<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/data1/SolrCloud/WordsConf/mmseg4j/words" />
<filter class="solr.LowerCaseFilterFactory"/>
	    </analyzer>
	</fieldtype>
	<fieldtype name="text_mmseg4j_maxWord" class="solr.TextField" positionIncrementGap="100">
	    <analyzer>
		<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="/data1/SolrCloud/WordsConf/mmseg4j/words" />
<filter class="solr.LowerCaseFilterFactory"/>
	    </analyzer>
	</fieldtype>

 2: solrconfig.xml:

<!-- mmseg4j reload words handler -->
  <requestHandler name="/mmseg4j/reloadwords" class="com.chenlb.mmseg4j.solr.MMseg4jHandler">
        <lst name="defaults">
        	<str name="dicPath">/data1/SolrCloud/WordsConf/mmseg4j/words</str>
        	<str name="check">true</str>
        	<str name="reload">true</str>
        </lst>
  </requestHandler>

 

3:在 /data1/SolrCloud/WordsConf/mmseg4j/words 目录下放入:

   3.1: mmseg4j-core-1.10.0.jar 中的 chars.dic, units.dic, words,dic , 这三个都是官方词库,你可以更改以便覆盖官方配置, 也可以不更改.

   3.2: 放入以文件名为words开头, .dic为文件结尾的UTF-8格式的文件, 如果是带BOM的UTF8文件, 第一行为空即可. 每行一个词.

 

4: 中文分词文件重新加载: 以下是单个节点的,如果涉及到多个节点或是SolrCloud,则每个节点都要执行以下访问方可使所有节点(可从zookeeper读取)都生效:

http://172.28.4.83:11010/solr/common_shard1_1_replica3/mmseg4j/reloadwords

=基本路径:http://172.28.4.83:11010/solr/common_shard1_1_replica3

+

handler路径:/mmseg4j/reloadwords

5:若有的节点加载但未生效, 执行以下reload命令:

curl 'http://172.28.4.83:11010/solr/admin/collections?action=RELOAD&name=common'

猜你喜欢

转载自rayoo.iteye.com/blog/2236789
今日推荐