Extractive Summarization using Continuous Vector Space Models

Kågebäck M, Mogren O, Tahmasebi N, et al. Extractive Summarization using Continuous Vector Space Models[C]// Cvsc at Eacl. 2014.
##Abstract

Using continuous vector representations for semantically aware representations of sentences as a basis for measuring similarity.
连续向量表示法在句子语义感知表征中的应用
实验证明该框架的表现很好
##Introduction
word embedding
知乎对word embedding的总结
 CSDN对词嵌入的简述
Submodular Optimization
模型优化
收益递减性质，来源于直觉，即把一个句子加到一小句子（即摘要）中比把一个句子加到一个更大的一个集合上做出的贡献更大。
This objective function can be formulated
as follows:

where S is the summary, L(S) is the coverage of the input text, R(S) is a diversity reward function. The lamada is a trade-off coefficient that allows us to define the importance of coverage versus diversity of the summary.——NP-hard
if the objective function is submodular there is a fast scalable algorithm that returns an approximation with a guarantee.
The weights Sim(i, j) used in the L function

where tfw,i and tfw,j are the number of occurrences of w in sentence i and j, and idfw is the inverse document frequency (idf ) of w.
句子的相似性是通过tf-idf 高度重叠的词来计算的，但下面这种情况会被认为没有相似性：
“The US President” and “Barack Obama”
本文提出we will investigate the use of continuous vector representations for measuring similarity between sentences
##Background on Deep Learning
Feed Forward Neural Network（FFNN）

FFNN四输入神经元，一个隐藏层，和1个输出神经元。这种架构是适合一些数据X∈R4（四维空间）的分类，但根据输入的数量和复杂性，隐藏层的尺寸应相应缩小。

神经元是分层结构的，只允许连接到后续层。该算法与用非线性项进行logistic回归相似。

线性回归，逻辑回归等
回归问题介绍（详细）
线性回归浅谈
An auto-encoder (AE), is a type of FFNN with a topology
designed for dimensionality reduction(是一种拓扑设计的降维FFNN)

图中显示了一个自动编码器，它将四维数据压缩成二维代码。这是通过使用一个称为编码层的瓶颈层来实现的。

The input and the output layers in an AE are identical(输入层和输出层是一样的)

Recursive Neural Network递归神经网络
RNN is a type of feed forward neural network that can process data through an arbitrary binary tree structure

递归神经网络结构使可变长度输入数据成为可能。通过对所有层使用相同的维数，任意二叉树结构可以递归处理。
输入数据被放置在树的叶节点中，并使用此树的结构将递归引导到根节点。在树上的每个非终结节点递归计算压缩表示，在每个节点上使用相同的权重矩阵。更确切地说，可以使用以下公式：
##Word Embeddings
Continuous distributed vector representation of
words, also referred to as word embeddings
一个词的嵌入是一个连续的向量表示，它捕获单词的语义和句法信息。可用来揭示单词之间的相似性
计算word embedding方法：

Collobert &Weston CW vector
Continuous Skip-gram 提出的方法： Word2Vec

##Phrase Embeddings
这里写图片描述
where xp is a phrase embedding, and xw is a word embedding. We use this method for computing phrase embeddings as a baseline in our experiments
-Unfolding Recursive Auto-encoder

unfolding RAE的结构，在一个三字词（[x1，x2，x3]）上。使用权重矩阵seta（e）对压缩表示进行编码，而使用seta（d）对表示进行解码并重构句子
##Measuring Similarity
短语嵌入为句子提供了语义意识表示。为了总结，我们需要测量两个表示之间的相似性，并将利用以下两个向量相似性度量。第一个相似性度量是余弦相似度，转换为[0,1]
这里写图片描述
其中x表示短语嵌入。第二个相似性是基于欧几里得距离的补充，并计算为:

##Conclusion
本文的研究结果表明在词汇和词组嵌入方面有很大的应用潜力。我们相信，通过使用嵌入，我们转向更多的语义意识汇总系统。

Extractive Summarization using Continuous Vector Space Models

猜你喜欢