理解n-gram模型

参考:https://blog.csdn.net/sxhlovehmm/article/details/41252125

数据稀疏的解释:假设词表中有20000个词,如果是bigram model(二元模型)那么可能的2-gram就有400000000个,如果是trigram(3元模型),那么可能的3-gram就有8000000000000个!那么对于其中的很多词对的组合,在语料库中都没有出现,根据最大似然估计得到的概率将会是0,这会造成很大的麻烦,在算句子的概率时一旦其中的某项为0,那么整个句子的概率就会为0,最后的结果是,我们的模型只能算可怜兮兮的几个句子,而大部分的句子算得的概率是0. 因此,我们要进行数据平滑(data Smoothing),数据平滑的目的有两个:一个是使所有的N-gram概率之和为1,使所有的N-gram概率都不为0,有关数据平滑处理的方法可以参考《数学之美》第33页的内容。

注意:2-gram中,(a,b)(b,a)算两个

40000 0000 = 20000 ×19999  

80000 0000 0000 = 20000 × 19999 ×19998 

--------------

n-gram模型,互不包含。例如:2 gram模型的词典并不包含1gram的词典.

如下举例有些不当,请勿参考。 

-----------------

from __future__ import print_function
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql import SparkSession
from pyspark.ml.feature import NGram

spark = SparkSession\
    .builder\
    .appName("NGramExample")\
    .getOrCreate()
    
#Hanmeimei loves LiLei
#LiLei loves Hanmeimei

wordDataFrame = spark.createDataFrame([
    (0, ["Hi", "I", "heard", "about", "Spark"]),
    (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
    (2, ["Logistic", "regression", "models", "are", "neat"])
], ["id", "words"])

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)

cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=20, minDF=1)

model = cv.fit(ngramDataFrame)

result = model.transform(ngramDataFrame)
result.show(truncate=False)
print(model.vocabulary)
print(len(model.vocabulary))

cv = CountVectorizer(inputCol="ngrams", outputCol="features", vocabSize=20, minDF=1)

model = cv.fit(ngramDataFrame)

result = model.transform(ngramDataFrame)
result.show(truncate=False)
print(model.vocabulary)
print(len(model.vocabulary))

打印如下:

+------------------------------------------------------------------+
|ngrams                                                            |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
+------------------------------------------------------------------+

+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------------+
|id |words                                     |ngrams                                                            |features                                            |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |[Hi I, I heard, heard about, about Spark]                         |(16,[0,2,3,5,14],[1.0,1.0,1.0,1.0,1.0])             |
|1  |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|(16,[0,4,6,7,8,10,13],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|2  |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat]    |(16,[1,9,11,12,15],[1.0,1.0,1.0,1.0,1.0])           |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------------+

['I', 'are', 'about', 'Spark', 'could', 'heard', 'classes', 'Java', 'use', 'regression', 'wish', 'Logistic', 'neat', 'case', 'Hi', 'models']
16
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------+
|id |words                                     |ngrams                                                            |features                                      |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |[Hi I, I heard, heard about, about Spark]                         |(14,[1,2,10,13],[1.0,1.0,1.0,1.0])            |
|1  |[I, wish, Java, could, use, case, classes]|[I wish, wish Java, Java could, could use, use case, case classes]|(14,[4,5,6,8,11,12],[1.0,1.0,1.0,1.0,1.0,1.0])|
|2  |[Logistic, regression, models, are, neat] |[Logistic regression, regression models, models are, are neat]    |(14,[0,3,7,9],[1.0,1.0,1.0,1.0])              |
+---+------------------------------------------+------------------------------------------------------------------+----------------------------------------------+

['regression models', 'I heard', 'heard about', 'models are', 'case classes', 'use case', 'could use', 'Logistic regression', 'wish Java', 'are neat', 'Hi I', 'I wish', 'Java could', 'about Spark']
14

猜你喜欢

转载自blog.csdn.net/m0_37870649/article/details/81623575
今日推荐