Table of contents
scenes to be used
In the big data scenario, it is generally used to count the frequency of keywords, so we need to segment some data texts to get the keywords we want.
import dependencies
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
</dependency>
use of participle
ArrayList<String> result = new ArrayList<>();
// 创建一个reader对象
StringReader reader = new StringReader(keyword);
// 创建一个分词对象
IKSegmenter ikSegmenter = new IKSegmenter(reader, true);
Lexeme next = ikSegmenter.next();
while ( next != null ) {
// 获取分词的结果
result.add(next.getLexemeText());
next = ikSegmenter.next();
}
return result;
word segmentation result
The result when useSmart = true, a word will not be repeated
The result when useSmart = false, the word will appear multiple times
Obviously useSmart = false works better
Packaging tools
package com.cw.util;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
/**
* @author CW
* @version 1.0
* @date 2023/3/1 8:41
* @desc ik分词工具类
*/
public class IKUtil {
/**
* 分词
* @param keyword 需要分词的文本
* @return
*/
public static List<String> splitKeyWord(String keyword) throws IOException {
ArrayList<String> result = new ArrayList<>();
// 创建一个reader对象
StringReader reader = new StringReader(keyword);
// 创建一个分词对象
IKSegmenter ikSegmenter = new IKSegmenter(reader, false);
Lexeme next = ikSegmenter.next();
while ( next != null ) {
// 获取分词的结果
result.add(next.getLexemeText());
next = ikSegmenter.next();
}
return result;
}
}