Overview of part-of-speech tagging
With the development of information technology, the geometric growth of the amount of information in the network has gradually become the main feature of today's society. Accurately extracting key information from text is the technical foundation of search engines and other fields, and word segmentation is particularly important as the first step in text information extraction.
As a basic research in the field of natural language processing, word segmentation has derived various text processing related applications.
Part-of-speech tagging includes word segmentation and marking a correct part of speech for each word in the word segmentation result (marking each word as a noun, verb, adjective or other part of speech). Developers can customize the granularity of word segmentation.
Operation Mechanism
Part-of-speech tagging provides an interface for automatic text segmentation and part-of-speech. For a piece of input text, it is automatically segmented through the part-of-speech tagging interface, and a correct part-of-speech is marked for each word in the word segmentation result. Part-of-speech tagging provides different word segmentation granularities, and developers can customize word segmentation granularity as needed.
Constraints and Restrictions
- Currently only Chinese context is supported.
- The part-of-speech tagging text is limited to 500 characters. If the number of characters exceeds, a parameter error will be returned. The text must be in UTF-8 format. If the format is wrong, no error will be reported, but it will lead to an error in the analysis result.
- Engine supports simultaneous access by multiple users, but does not support concurrent invocation of the same feature by the same user. If the same feature is called multiple times by the same process at the same time, a system busy error will be returned; if different processes call the same feature, only one process can process business at the same time, and other processes will enter the queue.
POS tagging development
scene introduction
- Applied to search engine development. For search engines, it is meaningless to find all the results in tens of billions of web pages. What is important is to present the most relevant results at the top, which is also called relevance ranking. Whether the word segmentation is accurate or not will directly affect the ranking of the relevance of the search results.
- Applied to the development of semantic analysis related software. In semantic analysis, understand the correct meaning of the text through word segmentation, and obtain the part of speech through part-of-speech tagging, and accurately determine whether a word is a noun, verb, adjective, etc., making semantic analysis easier to expand.
Interface Description
Part-of-speech tagging provides the getWordPos() interface, which can mark a correct part-of-speech for each word in the word segmentation result according to the segmentation granularity.
main interface
interface name |
describe |
---|---|
ResponseResult getWordPos(String requestData, int requestType) |
Part-of-speech tagging is done synchronously. |
ResponseResult getWordPos(final String requestData, final int requestType, final OnResultListener<ResponseResult> listener) |
Part-of-speech tagging is done asynchronously. |
void init(Context context, OnResultListener<Integer> listener, boolean isLoadModel) |
Initialize the NLU service. Before calling functional interfaces such as NLU, you need to call this interface first, and then call the NLU functional interface after obtaining the callback result in the onResult(T) method of OnResultListener. The developer passes in the listener parameter as a callback to wait for the calling process and result of the NLU functional interface. |
void destroy(Context context) |
Cancel all NLU tasks and destroy the NLU engine service. After calling this method, the NLU service can no longer be used. If you need to reuse the NLU service, you need to call init(Context, OnResultListener<Integer>, boolean)} again to initialize the NLU service. |
Interface input value description
- requestType indicates the request type, which is defined by the NluRequestType class as follows:
type
illustrate
static int
REQUEST_TYPE_LOCAL = 0 local request
- requestData represents the input text information in JSON format, as described in the following table.
parameter name
Is it required?
type
illustrate
text
ture
String
The text to be analyzed is encoded in UTF-8, limited to 500 characters.
type
false
long
The granularity of word segmentation, the default is 0.
- 0: basic words, smaller granularity. For example: "I want to watch The Fast and the Furious", divided into "I/want/watch/speed/and/furious".
- 1: On the basis of basic words, merge entities. For example: "I'm going to Jiangning Wanda Plaza to watch The Fast and the Furious" is divided into "I/want/go/Jiangning Wanda Plaza/watch/speed/and/passion".
For text information that has no mergeable entity, its word segmentation effect is the same as that of type 0. For example: "Watch a movie together at 3 o'clock tomorrow afternoon" is divided into "tomorrow/afternoon/3 o'clock/together/watch/movie".
- 9223372036854775807 (2 to the 63rd power minus 1): On the basis of type 1, merge the overall structure of entity time, place, etc. (not merge if there are symbols separated), and merge some common phrases.
For example: "adjective + of", "one-character verb + one-character noun", etc., simplify the sentence components. According to the above principles, "Tomorrow I will watch a movie at Jiangning Ruidu Jinyi Cinema from 3:00 to 5:00 p.m." will be divided into "Tomorrow 3:00 p.m./to/5:00 p.m./I am/at/Jiangning Ruidu Jinyi Cinema/watching/movie".
callPkg
false
String
caller name.
callType
false
int
Caller type:
- 0: normal application (default)
- 1: Quick App
callVersion
false
String
The caller version number.
callState
false
int
Caller state:
- -1: unknown (default)
- 0: foreground
- 1: background
Entity categories currently supported by NLU:
Entity class
Remark
Movie
Rely on dictionaries, require real use cases, do not modify.
TV drama
Rely on dictionaries, require real use cases, do not modify.
variety show
Rely on dictionaries, require real use cases, do not modify.
cartoon
Rely on dictionaries, require real use cases, do not modify.
train number
Real use cases are required and no modification is required.
flight number
Real use cases are required and no modification is required.
team
Rely on the dictionary, support NBA, CBA, Premier League, La Liga, Bundesliga, Serie A, Ligue 1, Chinese Super League team identification, require real use cases, do not modify.
person's name
Real use cases are required and no modification is required.
tracking number
Real use cases are required and no modification is required.
telephone number
Real use cases are required and no modification is required.
url
Real use cases are required and no modification is required.
Mail
Real use cases are required and no modification is required.
the league
NBA, CBA, Premier League, La Liga, Bundesliga, Serie A, Ligue 1, Chinese Super League, require real use cases, do not modify.
time
Real use cases are required and no modification is required.
Place
Contains hotels, restaurants, scenic spots, schools, roads, provinces, cities, counties, districts, towns, etc., partially relying on dictionaries.
verification code
The use case is real, do not modify it.
Interface return value description
The responseResult in the return value ResponseResult is a JSON string, reflecting the result of part-of-speech tagging:
parameter name |
Is it required? |
value type |
illustrate |
---|---|---|---|
code |
yes |
int |
The result code of part-of-speech tagging. Values include:
|
message |
yes |
String |
error message. |
pos |
no |
JSONArray |
The segmented word array, the type in the array is JSONObject. |
+word |
no |
String |
Segmented words. |
+tag |
no |
String |
词性,type为1或9223372036854775807时,人名实体的词性为nr,时间实体的词为t,地点实体的词性为ns,其他实体统一为ne。具体词性类型可参表1。 |
词性 |
说明 |
词性 |
说明 |
词性 |
说明 |
---|---|---|---|---|---|
n |
名词 |
rr |
人称代词 |
u |
助词 |
nr |
人名 |
rz |
指示代词 |
uzhe |
助词“着” |
ns |
地名 |
rzt |
时间指示代词 |
ule |
助词“了”“喽” |
ne |
只在实体合并时使用,除人名、时间、地点之前,其他实体统一返回ne |
rzs |
处所指示代词 |
uguo |
助词“过” |
t |
时间词 |
rzv |
谓词性指示代词 |
ude1 |
助词“的” |
tg |
时间词性语素 |
ry |
疑问代词 |
ude2 |
助词“地” |
s |
处所词 |
ryt |
时间疑问代词 |
ude3 |
助词”得” |
f |
方位词 |
rys |
处所疑问代词 |
usuo |
助词”所“ |
v |
动词 |
ryv |
谓词性疑问代词 |
udeng |
助词“等”“等等” |
vd |
副动词 |
rg |
代词性语素 |
uyy |
助词”一样”“一般”“似的”“般” |
vn |
名动词 |
m |
数词 |
udh |
助词“的话” |
vshi |
动词“是” |
mq |
数量词 |
uls |
助词“来讲”“来说”“而言”“说来” |
vyou |
动词“有” |
q |
量词 |
uzhi |
助词“之“ |
vf |
趋向动词 |
qv |
动量词 |
ulian |
助词“连” |
a |
形容词 |
qt |
时量词 |
e |
叹词 |
ad |
副形词 |
d |
副词 |
y |
语气词 |
an |
名形词 |
p |
介词 |
o |
拟声词 |
b |
区别词 |
pba |
介词“把” |
h |
前缀 |
bl |
区别词性惯用语 |
pbei |
介词“被” |
k |
后缀 |
z |
状态词 |
c |
连词 |
x |
字符串 |
r |
代词 |
cc |
并列连词 |
idiom |
成语 |
w |
标点符号 |
- |
- |
- |
- |
开发步骤
在使用词性标注的接口时,将实现词性标注的相关类添加至工程。
import ohos.ai.nlu.NluRequestType;
import ohos.ai.nlu.NluClient;
import ohos.ai.nlu.OnResultListener;
import ohos.ai.nlu.ResponseResult;
使用NluClient静态类进行初始化,通过异步方式获取服务的连接。
- context:应用上下文信息,应为ohos.aafwk.ability.Ability或ohos.aafwk.ability.AbilitySlice的实例或子类实例。
- listener:初始化结果的回调,可以传null。
- isLoadModel:是否加载模型,如果传true,则在初始化时加载模型;如果传false,则在初始化时不加载模型。
NluClient.getInstance().init(context, new OnResultListener<Integer>(){
@Override
public void onResult(Integer result){
// 初始化成功回调,在服务初始化成功调用该函数
}
}, true);
调用词性标注的接口。
采用同步方式进行词性标注:
String requestData = "{\"text\":\"我要看速度与激情\",\"type\":0}";
ResponseResult responseResult = NluClient.getInstance().getWordPos(requestData, NluRequestType.REQUEST_TYPE_LOCAL);
采用异步方式进行词性标注:
NluClient.getInstance().getWordPos(requestData,
NluRequestType.REQUEST_TYPE_LOCAL, new OnResultListener<ResponseResult>() {
@Override
public void onResult(ResponseResult result) {
//异步返回处理
}
});
销毁NLU服务。
NluClient.getInstance().destroy(context);