mahout中bayes分类分析—2

2、模型

以上训练部分的四个job 执行完毕后，整个 bayes 模型就建立完毕了，总共生成并保存三个目录文件：

trainer-tfIdf

trainer-weights

trainer-thetaNormalizer

我们可以将模型从分布式上Sequence 文件导成本地的 txt 文件进行查看。

3、测试

调用类：TestClassifier

所在包：package org.apache.mahout.classifier.bayes;

根据命令行参数会选择顺序执行还是并行map/reduce 执行，这里只分析并行 map/reduce ，执行时会调用 BayesClassifierDriver 类

分析BayesClassifierDriver 类

先runjob ， runjob 中先运行 BayesClassifierMapper 再是 BayesClassifierReducer ， job 执行完毕后会调用混合矩阵： ConfusionMatrix 函数显示结果

⑴ BayesClassifierMapper

首先，运行configure ：先 algorithm=new BayesAlgorithm()和 datastore=new InMemoryDatastore （ params) ， datastore 时 InMemoryDatastore （ params) 方法将模型装入到 datastore 中即装入 Sigma_j 、 Sigma_ k、 Sigma_ k Sigma_j 、 thetaNormalizer 、weight=TfIdf 、 alpha_i=1.0 ）；再 classifier=new classifierContext(algorithm,datastore)， classifier.initialize() ，即初始化 classifier ，初始化 classifier 是 datastore.initialize() 和 algorithm.initialize （ this.datastore ）。

datastore的初始化：

调用SequenceFileModelReader 的 loadModel 方法（五个 Load 方法）：

① loadFeatureWeights（装入的是 Sigma_j ）生成hashmap Sigma_j {0， weight 1 ， weight … }其中 0 、 1 … 等是属性的标号，weight 是 Sigma_j 的value 。

② loadLabelWeights（装入的是 Sigma_ k）生成 hashmap Sigma_ k{0， weight 1 ， weight … }其中 0 、 1 … 等是label 即类标签的标号， weight 是 Sigma_ k的 value 。

③loadSumWeight （装入的是 Sigma_ k Sigma_j ）使datastore 的成员变量 Sigma_ j Sigma_ k=value（训练得到的所有 tfidf 总和）。

④loadThetaNormalizer （装入的是 ThetaNormalizer ）生成 hashmap thetaNormalizerPerlabel{0 ， weight 1 ， weight … }其中 weight 是传进来的 value ，使 datastore 的成员变量 thetaNormalizer=Max(1.0 |weight|) 。

⑤loadWeightMatrix （装入的是 weight 即 tfidf ）生成 weightMatrix 是 SparseMatrix ，其中行是属性的标号，列是 label 的标号，行列交叉的地方是 tfidf 。

algorithm的初始化：

调用datastore.getKeys ， getKeys 返回 labeldicionary.Keys 即返回一个集合，里面放的是所有的 label 。

其次，运行map ：开始分类 classifier.classifyDocument(),classifyDocument() 调用 algorithm.classifyDocument 。先 new result{unkonwm,0} categories= ” label weight ” 即所有的label 集合；再开始循环：针对每一个类进行循环，调用documenWeight ：先计算文档中每个词的次数（ frequency ），生成一个 Map 叫 wordlist ，针对 wordlist 的 each pair 计算：∑ [frequenc y×featureweight(datastore,label,word) ]。其中 featureweight共四个，都调用datastore.getWeight，以下分别分析：

①double result = 调用datastore.getWeight,稀疏矩阵的getQuick,取出矩阵的Tfidf值；

②double vocabCount =属性总数；

③double sumLableWeight =Sigma_k的值；

④double numerator =result + 1.0；

⑤double denominator = sumLableWeight + vocabCount；

⑥double weight =log(numerator/denominator)也就是=log[(Tfidf+1.0)/(Sigma_k+属性个数)]；

返回的是result = -weight；

所以说， documenWeight返回的值是测试文档属于某类的概率的大小，即所有属性的在某类下的 frequenc y×result之和与在其他类下的和值进行比较，最大值的，取出它的label，文档就属于此类。

key=_CT 正确 label 分类 label value=1.0

⑵ BayesClassifierReducer

只是合并map的结果。

key=_CT 正确 label 分类 label value= 正确分类的文档数

根据对以上∑（frequenc y×result）进行分析，参照贝叶斯多项式模型， frequency 是对weight中取对数时转移到前面的，即log(numerator/denominator) frequency = frequency ×log(numerator/denominator)，weight是条件概率，即log[(numerator/denominator) frequency

×(numerator/denominator) frequency … ] = ∑ log(numerator/denominator) frequency 因为按贝叶斯原理来说，后验概率=先验概率×条件概率，据我理解，此处为什么没有乘先验概率，可能是因为所用的20个新闻的数据每类中的文档数大致一样，先验概率几乎一样，所以没必要乘（个人猜测）。

⑶ ConfusionMatrix函数显示结果

key=正确 label value={key= 分类 label value= 值 }

mahout中bayes分类分析—2

猜你喜欢