字典特征提取:
from sklearn.feature_extraction import DictVectorizer
alist = [
{
'city':"BJ",'temp':33},
{
'city':"GZ",'temp':42},
{
'city':"SH",'temp':40},
]
d = DictVectorizer(sparse=False)
feature = d.fit_transform(alist)
print(d.get_feature_names())
print(feature)# 返回矩阵
运行结果:
文本特征提取:
import jieba
jb1 = jieba.cut("人生苦短,我用python")
jb2 = jieba.cut("人生漫长,不用python")
ct1 = ' '.join(list(jb1))
ct2 = ' '.join(list(jb2))
from sklearn.feature_extraction.text import CountVectorizer
vector = CountVectorizer()
res = vector.fit_transform([ct1,ct2])
# 单个汉字不统计
print(res)
print(vector.get_feature_names())
print(res.toarray())
运行结果: