LDA + SVM 文本分类

关于LDA的一些理解

对于语料库中的每篇文档，LDA定义了如下生成过程（generative process）：

对每一篇文档，从主题分布中抽取一个主题；
从上述被抽到的主题所对应的单词分布中抽取一个单词；
重复上述过程直至遍历文档中的每一个单词。

$P(word|document) = P(word|topic)\times P(topic|document)$

文档document中出现单词word的概率 = 主题topic中出现word的概率 $\times$ document中出现topic的概率。

以Topic作为中间层，可以通过当前的 $\theta d$ 和 $\varphi t$ 得到 $P(word|document)$ ，其中， $P(topic|document)$ 利用 $\theta d$ 得到， $P(word|topic)$ 通过 $\varphi t$ 得到。
实际上，利用当前的 $\theta d$ 和 $\varphi t$ ，我们可以为一个文档中的一个单词计算它对应任意一个Topic时的 $P(Word|document)$ ，然后根据这些结果来更新这个词对应的Topic。如果这个更新改变了这个单词所对应的Topic，就会反过来影响 $\theta d$ 和 $\varphi t$ 。

公式背景：

一个函数：Gamma函数
- $F(n) = \int^\infty_0 t^{x-1}e^{-t}dt$
- Gamma函数是阶乘函数在实数上的推广
四个分布：二项分布、多项分布、beta分布、Dirichlet分布
一个概念和一个理念：共轭先验和贝叶斯框架
- 共轭分布：后验概率（posterior probability） $\propto$ 似然函数(likelyhood function) $\times $先验概率(prior probability)
两个模型：pLSA、LDA
一个采样：Gibbs采样

sklean.decomposition.LatentDirichletAllocation

sklearn的LDA方法与参数说明：

class sklearn.decomposition.LatentDirichletAllocation（n_components = 10，doc_topic_prior = None，topic_word_prior = None，learning_method = None，learning_decay = 0.7，learning_offset = 10.0，max_iter = 10，batch_size = 128，evaluate_every = -1，total_samples = 1000000.0，perp_tol = 0.1，mean_change_tol = 0.001，max_doc_update_iter = 100，n_jobs = 1，verbose = 0，random_state = None，n_topics = None ）

n_components：int，optional（默认值= 10）

主题数量，老版本该参数名为n_topics

doc_topic_prior：float，optional（默认=无）

之前的文档主题分发theta。如果值为None，则默认为1 / n_components。在文献中，这被称为阿尔法。

topic_word_prior：float，optional（默认=无）

之前的主题词分发beta。如果值为None，则默认为1 / n_components。在文献中，这被称为eta。

learning_method：‘batch’(批量更新)| ‘online’（在线更新），默认=‘online’

用于更新_component的方法。仅用于fit方法。通常，如果数据量很大，则在线更新将比批量更新快得多。默认学习方法将在0.20版本中更改为“批处理”。有效选项：
'batch'：批量变分贝叶斯方法(Batch variational Bayes method)。每次EM更新都会使用所有训练数据。
    旧的components_将在每次迭代中被覆盖。
'online'：在线变分贝叶斯方法(Online variational Bayes method)。在每次EM更新中，使用
    小批量的训练数据来更新components_变量的递增，学习率由learning_decay和learning_offset参数控制。

learning_decay：float，optional（默认值= 0.7）

它是在线学习方法中控制学习率的参数。该值应设置在（0.5,1.0）之间以保证渐近收敛。当值为0.0且batch_size为时 n_samples，更新方法与批量学习相同。在文献中，这称为kappa。

learning_offset：float，optional（默认值= 10.）

一个（正）参数，可以降低在线学习中的早期迭代。它应该大于1.0。在文献中，这称为tau_0。

max_iter：整数，可选（默认= 10）

最大迭代次数。

batch_size：int，optional（默认值= 128）

每次EM迭代中使用的文档数。仅用于在线学习。

evaluate_every：int，optional（默认值= 0）

多久评估一次困惑。仅用于fit方法。将其设置为0或负数，以便根本不评估训练中的困惑。评估困惑可以帮助您检查培训过程中的收敛，但也会增加总培训时间。评估每次迭代中的困惑可能会将训练时间增加两倍。

total_samples：int，optional（default = 1e6）

文件总数。仅用于partial_fit方法。

perp_tol：float，optional（默认值= 1e-1）

批量学习中的困惑容忍度。仅在 evaluate_every大于0时使用。

mean_change_tol：float，optional（默认值= 1e-3）

在E步骤中停止更新文档主题分发的容差。

max_doc_update_iter：int（默认值= 100）

在E步骤中更新文档主题分发的最大迭代次数。

n_jobs：int，optional（默认值= 1）

在E步骤中使用的作业数。如果为-1，则使用所有CPU。对于 n_jobs-1以下，使用（n_cpus + 1 + n_jobs）。

verbose：int，optional（默认值= 0）

详细程度。

random_state：int，RandomState实例或None，可选（默认=无）

如果是int，则random_state是随机数生成器使用的种子; 如果是RandomState实例，则random_state是随机数生成器; 如果没有，随机数生成器所使用的RandomState实例np.random。

方法

`fit`（X [，y]）	使用变分贝叶斯方法学习数据X的模型。
`fit_transform`（X [，y]）	适合数据，然后转换它。
`get_params`（[deep]）	获取此估算工具的参数。
`partial_fit`（X [，y]）	在线VB与Mini-Batch更新。
`perplexity`（X [，doc_topic_distr，sub_sampling]）	计算数据X的近似困惑。
`score`（X [，y]）	计算近似对数似然值作为分数。
`set_params`（** PARAMS）	设置此估算器的参数。
`transform`（X）	根据拟合模型转换数据X.

__init__（n_components = 10，doc_topic_prior = None，topic_word_prior = None，learning_method = None，learning_decay = 0.7，learning_offset = 10.0，max_iter = 10，batch_size = 128，evaluate_every = -1，total_samples = 1000000.0，perp_tol = 0.1，mean_change_tol = 0.001，max_doc_update_iter = 100，n_jobs = 1，verbose = 0，random_state = None，n_topics = None ）[source]
fit（X，y =无）[来源]

使用变分贝叶斯方法学习数据X的模型。当learning_method为“在线”时，请使用小批量更新。否则，请使用批量更新。参数：X：类似数组或稀疏矩阵，shape =（n_samples，n_features）文档字矩阵。y：忽略了。return：self
fit_transform（X，y =无，*** fit_params* ）[来源]

适合数据，然后转换它。使用可选参数fit_params使变换器适合X和y，并返回X的变换版本。参数：X：numpy数组形状[n_samples，n_features]训练集。y：numpy数组形状[n_samples]目标值。返回：X_new：numpy形状数组[n_samples，n_features_new]变形阵列。
get_params（深=真）[来源]

获取此估算工具的参数。参数：deep：布尔值，可选如果为True，将返回此估计器的参数并包含作为估算器的子对象。返回：params：将字符串映射到任意字符串映射到其值的参数名称。
partial_fit（X，y =无）[来源]

在线VB与Mini-Batch更新。参数：X：类似数组或稀疏矩阵，shape =（n_samples，n_features）文档字矩阵。y：忽略了。return：self
perplexity（X，doc_topic_distr =‘不赞成’，sub_sampling = False ）[来源]

计算数据X的近似困惑。困惑定义为exp（-1。每个单词的对数似然）版本0.19中已更改：已弃用doc_topic_distr*参数并将其忽略，因为用户无法再访问非标准化分布参数：X：类似数组或稀疏矩阵，[n_samples，n_features]文档字矩阵。doc_topic_distr：无或数组，shape =（n_samples，n_components）文档主题分发。此参数已弃用，目前正在被忽略。从版本0.19开始不推荐使用。sub_sampling：bool是否进行二次采样。返回：得分：漂浮困惑得分。
score（X，y =无）[来源]

计算近似对数似然值作为分数。参数：X：类似数组或稀疏矩阵，shape =（n_samples，n_features）文档字矩阵。y：忽略了。返回：得分：漂浮使用近似界限作为分数。
set_params（*** params* ）[来源]

设置此估算器的参数。该方法适用于简单估计器以及嵌套对象（例如管道）。后者具有表单的参数，<component>__<parameter>以便可以更新嵌套对象的每个组件。return：self
transform（X ）[来源]

根据拟合模型转换数据X.在版本0.18中更改：doc_topic_distr现在已标准化参数：X：类似数组或稀疏矩阵，shape =（n_samples，n_features）文档字矩阵。返回：doc_topic_distr：shape =（n_samples，n_components）X的文档主题分发。

实例1 : 使用LDA+SVM进行文本多分类

网上没找到LDA与SVM结合的代码，我自己的实现方法如下，不知道用的正不正确，仅供参考。由于在比赛，暂不提供完整的参考代码。

Step 1. CV特征提取

LDA模型学习时的训练数据并不是一篇篇文本，而是Document-word matrix，它可以是array也可以是稀疏矩阵，维数是n_samples*n_features，其中n_features为词(term)的个数。因此在训练LDA主题模型前，需要先利用CountVectorizer统计词频并保存

from sklearn.feature_extraction.text import CountVectorizer

# 构建总单词矩阵
count_v0= CountVectorizer();  
counts_all = count_v0.fit_transform(all_text); #all_text为训练集+测试集语料库

# 构建训练集单词矩阵
count_v1= CountVectorizer(vocabulary=count_v0.vocabulary_)
counts_train = count_v1.fit_transform(train_texts)

# 构建测试集单词矩阵
# count_v2 = CountVectorizer(vocabulary=count_v0.vocabulary_) 
# counts_test = count_v2.fit_transform(test_texts);

Step 2. LDA构建词模型（不太清楚这里应该怎么说？感觉依旧在做特征工程）

核心代码为三步：构建模型、拟合数据(fit)、根据拟合模型转换数据(transform)。

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=n_component, max_iter=50, learning_method='batch')
X_train = lda.fit(counts_train).transform(counts_train)
# X_test = lda.fit(counts_test).transform(counts_test)

Step 3. 使用分类模型（SVC为例）

把LDA构建好的词模型输入到分类器中即可。

svclf = SVC(kernel = 'linear') 
svclf.fit(x_train,y_train)  
preds = svclf.predict(x_test)
# ...

实例2 ：使用LDA进行文本多分类

# 加载数据，使用sklearn自带的fetch_20newsgroups数据集
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
n_samples=200
data_samples = dataset.data[:n_samples] #截取需要的量，n_samples=2000

# CountVectorizer统计词频
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib  #也可以选择pickle等保存模型，请随意
n_features=2500
#构建词汇统计向量并保存，仅运行首次
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(data_samples)
joblib.dump(tf_vectorizer,'tf_Model.pkl',compress=3)

#==============================================================================
#得到存储的tf_vectorizer,节省预处理时间
#from sklearn.externals import joblib
#tf_vectorizer = joblib.load('tf_Model.pkl')
#tf = tf_vectorizer.fit_transform(data_samples)
#==============================================================================

from sklearn.decomposition import LatentDirichletAllocation
n_topic = 10
n_topics = 30
lda = LatentDirichletAllocation(n_topics=n_topic, 
                                max_iter=50,
                                learning_method='batch')
lda.fit(tf) #tf即为Document_word Sparse Matrix

def print_top_words(model, feature_names, n_top_words):
    #打印每个主题下权重较高的term
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic #%d:" % topic_idx)
        print (" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
    #打印主题-词语分布矩阵
    print (model.components_)

n_top_words=20
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

#print trained topic model
tf_feature_names = tf_vectorizer.get_feature_names()
for idx, topic in enumerate(lda.components_, start=1):
    print('Topic #%d' % idx)
    print("/".join([tf_feature_names[i] for i in topic.argsort ()[:-11:-1]]))   #打印（主题-词汇）向量

lda.transform(tf)[0]  #打印（文章-主题）向量