nlp Task2

jieba :https://github.com/fxsjy/jieba

支持三种分词模式:

  精确模式,试图将句子最精确地切开,适合文本分析;

  全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;

  搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。 

  支持繁体分词

  支持自定义词典

jieba demo :https://github.com/fxsjy/jiebademo

语言模型:三种

unigram,bigram,trigram,是自然语言处理(NLP)中的问题。父词条:n-gram.
unigram: 单个word  [1] 
bigram: 双word
trigram:3 word
比如:
西安交通大学:
unigram 形式为:西/安/交/通/大/学
bigram形式为: 西安/安交/交通/通大/大学
trigram形式为:西安交/安交通/交通大/通大学
 
tfidf使用:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


df_path = 'a.csv'
df = pd.read_csv(df_path)
df_title = df.iloc[:,2]
df_tf = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
df_tf_fit=df_tf.fit_transform(df_title)
words = df_tf.get_feature_names()
words_tf = list(df_tf_fit.toarray().sum(axis=0))


tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(df_tf_fit.toarray())

words_idf = []
for idx, word in enumerate(df_tf.get_feature_names()):
words_idf.append(float(tfidf_transformer.idf_[idx]))
print(words)
print(words_tf)
print(words_idf)
words_tfidf_Normalized = []
words_tfidf= []
for i in range(0,len(words)):
s = words_tf[i]/(max(words_tf)-min(words_tf))
n = words_idf[i]/(max(words_idf)-min(words_idf))
words_tfidf_Normalized.append(s*n)

for i in range(0,len(words)):
s = words_tf[i]
n = words_idf[i]
words_tfidf.append(s*n)

print(words_tfidf_Normalized)
print(words_tfidf)

print('F')

sklearn:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

猜你喜欢

转载自www.cnblogs.com/willert/p/10864850.html