转载:https://blog.csdn.net/chuchus/article/details/77716545
1.简介
一个python NLP库. 包含tf-idf模型, word2vec 与 doc2vec 等.
官网地址
2.word2vec
官方教程:models.word2vec – Deep learning with word2vec
2.1 类与方法
gensim.models.word2vec.Word2Vec(utils.SaveLoad)
类. 用于训练, 使用, 评估 word2vec 模型.__init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5, ...)
sentences
: 一个list, 元素为sentence. sentence也是一个list, 格式为[word1, word2, …, word_n].size
: the dimensionality of the feature vectors.window
: the maximum distance between the current and predicted word within a sentence.alpha
: the initial learning rate.seed
: for the random number generatormin_count
: ignore all words with total frequency lower than this.save(self, *args, **kwargs)
持久化模型, 如model.save('/tmp/mymodel')
.@classmethod load(cls, *args, **kwargs)
将持久化的模型反序列化回来. 如new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
.model[word]
如, model[‘computer’], 返回的是该单词的向量, 它是NumPy的vector.- model.wv.similar_by_word(self, word, topn=10,…)
查询一个词的k-nearest neighbor. 计算的是 余弦相似度.
2.2一些例子
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
# 得到('queen', 0.71382287), ...]
model.wv.doesnt_match("breakfast cereal dinner lunch".split())
# 'cereal'
model.wv.similarity('woman', 'man')
# 0.73723527
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
3.doc2vec
官方教程:models.doc2vec – Deep learning with paragraph2vec
在word2vec中, 语料库的词典都是十几万级别的, 所以来了新句子, 里面的 word 也很少碰到未登录的.
而在doc2vec中, 来了一篇新文章, 它就是未登录的, gensim 提供了 gensim.models.doc2vec.Doc2Vec#infer_vector(self, doc_words, alpha=0.1, min_alpha=0.0001, steps=5)
函数, 产出模型后, 用于预测新文档的 vector representation.
常用类与方法
gensim.similarities.docsim.SparseMatrixSimilarity(interfaces.SimilarityABC)
类, 用余弦相似度 来度量.
4.tf_idf model
import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
# First, create a small corpus of 9 documents and 12 features
# a list of list of tuples
# see: https://radimrehurek.com/gensim/tut1.html
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
[(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
[(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
[(0, 1.0), (4, 2.0), (7, 1.0)],
[(3, 1.0), (5, 1.0), (6, 1.0)],
[(9, 1.0)],
[(9, 1.0), (10, 1.0)],
[(9, 1.0), (10, 1.0), (11, 1.0)],
[(8, 1.0), (10, 1.0), (11, 1.0)]]
tfidf = models.TfidfModel(corpus)
vec = [(0, 1), (4, 1)]
print(tfidf[vec])
# shape=9*12
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
sims = index[tfidf[vec]]
print(list(enumerate(sims)))
"""
[(0, 0.8075244024440723), (4, 0.5898341626740045)]
# Document number zero (the first document) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.
[(0, 0.4662244), (1, 0.19139354), (2, 0.24600551), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
"""