Simple sorting of word segmentation processing (including word attribute processing) using HanLP in Android Studio's Android
Table of contents
Appendix: In HanLP, the nature field of the Term object indicates the part of speech
1. Brief introduction
Some basic operations in Android development are sorted out for later use.
This section introduces how to use HanLP to perform word segmentation processing (including word attribute processing) of sentence paragraphs in Android.
On the Android platform, in addition to HanLP, there are other algorithms and tools that can be used for Chinese word segmentation. The following are some common Chinese word segmentation algorithms, and some advantages of HanLP in word segmentation:
Common Chinese word segmentation algorithms and tools:
ansj_seg: ansj_seg is a Chinese word segmentation tool based on CRF and HMM model, suitable for Java platform. It supports fine-grained and coarse-grained word segmentation, and has certain custom dictionary and part-of-speech tagging functions.
jieba: jieba is a Chinese word segmentation library widely used in Python, but also has its Java version. It uses a word segmentation method based on a prefix dictionary, and performs well in terms of speed and effect.
lucene-analyzers-smartcn: This is a Chinese tokenizer in the Apache Lucene project, using a rule-based word segmentation algorithm. It is widely used in Lucene search engine.
ictclas4j: ictclas4j is a Chinese word segmentation tool developed by the Institute of Computing Technology, Chinese Academy of Sciences, based on the HMM model. It supports custom dictionaries and part-of-speech tagging.
Advantages of HanLP word segmentation:
Multi-domain applicability: HanLP is designed as a multi-domain Chinese natural language processing toolkit, which not only includes word segmentation, but also supports various tasks such as part-of-speech tagging, named entity recognition, and dependency syntax analysis.
Performance and effect: HanLP has been trained and optimized on multiple standard datasets, and has good word segmentation effect and performance.
Flexible dictionary support: HanLP supports custom dictionaries, and you can add vocabulary in professional fields as needed to improve word segmentation.
Open Source: HanLP is open source, you can use, modify and distribute it freely, which facilitates customization and integration into your projects.
Multi-language support: HanLP not only supports Chinese, but also supports other languages, such as English, Japanese, etc., which facilitates cross-language processing.
Active community: HanLP has an active community and maintenance team that helps with problem solving and support.
In a word, HanLP is a feature-rich and high-performance Chinese natural language processing tool, which is suitable for various application scenarios, especially in multi-domain text processing tasks. However, the final choice depends on your specific needs and project context.
HanLP Official Website: HanLP | Online Demo
2. Implementation principle
1. Use StandardTokenizer.segment(text) to pass in the text Text content for word segmentation
2. Use Term.word; to get the participle content, and Term.nature.toString() to get the participle attributes
3. Matters needing attention
1. Chinese words will have a more accurate corresponding attribute, but English words may not
4. Effect preview
5. Implementation steps
1. Open Android Studio to create an empty project, and introduce HanLP in build.gradle
implementation 'com.hankcs:hanlp:portable-1.7.5' Remember Sync nNow
2. Create the script ChineseSegmentationExample to realize the word segmentation function
3. Call it in the main script, and input the content to be segmented
4. Package and run on the Android machine, the effect is as above
6. Key code
1、ChineseSegmentationExample
package com.xxxx.testchinesesegmentationexample;
import com.hankcs.hanlp.seg.common.Term;
import com.hankcs.hanlp.tokenizer.StandardTokenizer;
import java.util.List;
public class ChineseSegmentationExample {
/**
* 分词
* @param wordsContent 要进行分词的内容
*/
public static void SegmentWords(String wordsContent) {
String text = wordsContent;
// 进行分词
List<Term> terms = StandardTokenizer.segment(text);
// 遍历分词结果,判断词性并打印
for (Term term : terms) {
String word = term.word;
String pos = term.nature.toString();
String posInfo = getPosInfo(pos); // 判断词性属性
System.out.println("Word: " + word + ", POS: " + pos + ", Attribute: " + posInfo);
}
}
/**
* 判断词性属性
* @param pos
* @return 属性
*/
static String getPosInfo(String pos) {
// 这里你可以根据需要添加更多的判断逻辑来确定词性属性
if (pos.equals("n")) {
return "名词";
} else if (pos.equals("v")) {
return "动词";
} else if (pos.equals("ns")) {
return "地名";
}else if (pos.equals("t")) {
return "时间";
}
else {
return "其他";
}
}
}
2、MainActivity
ackage com.xxxxx.testchinesesegmentationexample;
import androidx.appcompat.app.AppCompatActivity;
import android.os.Bundle;
public class MainActivity extends AppCompatActivity {
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
ChineseSegmentationExample.SegmentWords("现在几号,几点钟,今天明天后天昨天北京深圳的天气如何。");
}
}
Addendum: In HanLP, Term
the object's nature
field represents the part of speech
In HanLP, the field
Term
of the objectnature
represents Part of Speech (POS). HanLP uses a standard Chinese part-of-speech tagging system, and each part of speech has a unique identifier. Here are some common Chinese part-of-speech tags and their meanings:
noun class:
n
: common nounnr
: namens
: place nament
: Organization namenz
: other proper namesnl
: noun idiomng
: noun morphemetime class:
t
: time wordVerbs:
v
:verbvd
: Adverbvn
: noun verbvshi
: verb "to be"vyou
: verb "to have"Adjective class:
a
:adjectivead
: adverbAdverb class:
d
:adverbPronoun class:
r
:pronounrr
:Personal Pronounsrz
:Demonstrativerzt
: time demonstrative pronounConjunction class:
c
:conjunctionParticle class:
u
:particleNumeral class:
m
:numeralQuantifier class:
q
:quantifierParts of speech:
y
:ModalInterjection class:
e
:interjectionOnomatopoeia:
o
:OnomatopoeiaPart of speech:
f
:Position of the wordStatus part of speech:
z
: status wordPreposition class:
p
:prepositionPrefix class:
h
: prefixSuffix class:
k
:suffixPunctuation classes:
w
: PunctuationPlease note that the above are just some common part-of-speech tags and their meanings, and the actual situation may be more complicated. You can investigate HanLP documentation for more details on part-of-speech tagging as needed. Based on these part-of-speech tags, you can write code to judge the attributes of words (such as verbs, nouns, place names, etc.) and perform corresponding processing.