参考:https://blog.csdn.net/sxhlovehmm/article/details/41252125
数据稀疏的解释:假设词表中有20000个词,如果是bigram model(二元模型)那么可能的2-gram就有400000000个,如果是trigram(3元模型),那么可能的3-gram就有8000000000000个!那么对于其中的很多词对的组合,在语料库中都没有出现,根据最大似然估计得到的概率将会是0,这会造成很大的麻烦,在算句子的概率时一旦其中的某项为0,那么整个句子的概率就会为0,最后的结果是,我们的模型只能算可怜兮兮的几个句子,而大部分的句子算得的概率是0. 因此,我们要进行数据平滑(data Smoothing),数据平滑的目的有两个:一个是使所有的N-gram概率之和为1,使所有的N-gram概率都不为0,有关数据平滑处理的方法可以参考《数学之美》第33页的内容。
注意:2-gram中,(a,b)(b,a)算两个
40000 0000 = 20000 ×19999
80000 0000 0000 = 20000 × 19999 ×19998
--------------
n-gram模型,互不包含。例如:2 gram模型的词典并不包含1gram的词典.
如下举例有些不当,请勿参考。
-----------------
from __future__ import print_function
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql import SparkSession
from pyspark.ml.feature import NGram
spark = SparkSession\
.builder\
.appName("NGramExample")\
.getOrCreate()
#Hanmeimei loves LiLei
#LiLei loves Hanmeimei
wordDataFrame = spark.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"])
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=20, minDF=1)
model = cv.fit(ngramDataFrame)
result = model.transform(ngramDataFrame)
result.show(truncate=False)
print(model.vocabulary)
print(len(model.vocabulary))
cv = CountVectorizer(inputCol="ngrams", outputCol="features", vocabSize=20, minDF=1)
model = cv.fit(ngramDataFrame)
result = model.transform(ngramDataFrame)
result.show(truncate=False)
print(model.vocabulary)
print(len(model.vocabulary))
打印如下:
+------------------------------------------------------------------+
|ngrams |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark] |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat] |
+------------------------------------------------------------------+
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------------+
|id |words |ngrams |features |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------------+
|0 |[Hi, I, heard, about, Spark] |[Hi I, I heard, heard about, about Spark] |(16,[0,2,3,5,14],[1.0,1.0,1.0,1.0,1.0]) |
|1 |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|(16,[0,4,6,7,8,10,13],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|2 |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat] |(16,[1,9,11,12,15],[1.0,1.0,1.0,1.0,1.0]) |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------------+
['I', 'are', 'about', 'Spark', 'could', 'heard', 'classes', 'Java', 'use', 'regression', 'wish', 'Logistic', 'neat', 'case', 'Hi', 'models']
16
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------+
|id |words |ngrams |features |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------+
|0 |[Hi, I, heard, about, Spark] |[Hi I, I heard, heard about, about Spark] |(14,[1,2,10,13],[1.0,1.0,1.0,1.0]) |
|1 |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|(14,[4,5,6,8,11,12],[1.0,1.0,1.0,1.0,1.0,1.0])|
|2 |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat] |(14,[0,3,7,9],[1.0,1.0,1.0,1.0]) |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------+
['regression models', 'I heard', 'heard about', 'models are', 'case classes', 'use case', 'could use', 'Logistic regression', 'wish Java', 'are neat', 'Hi I', 'I wish', 'Java could', 'about Spark']
14