Efficient Estimation of Word Representations in Vector Space（翻译）

We propose two novel model architectures for computing continuous vector representations
of words from very large data sets. The quality of these representations
is measured in a word similarity task, and the results are compared to the previously
best performing techniques based on different types of neural networks. We
observe large improvements in accuracy at much lower computational cost, i.e. it
takes less than a day to learn high quality word vectors from a 1.6 billion words
data set. Furthermore, we show that these vectors provide state-of-the-art performance
on our test set for measuring syntactic and semantic word similarities.

我们提出两个用于在大规模数据集上计算连续词向量表示的新模型框架。这两种表示的评估方式为：在词的相似性计算，此结果对比最近表现最佳的不同类型的神经网络类型。对比结果显示，在大幅降低计算消耗的情况下准确性获得了提升。例如：在16亿词的数据集合上，训练少于1天，可以获得一个高质量的词向量。此高质量的词向量，在用于评估的语法、语义相似性的数据集上获得了最先进的性能。

1 Introduction
Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity
between words, as these are represented as indices in a vocabulary. This choice has several good
reasons - simplicity, robustness and the observation that simple models trained on huge amounts of
data outperform complex systems trained on less data. An example is the popular N-gram model
used for statistical language modeling - today, it is possible to train N-grams on virtually all available
data (trillions of words [3]).
1简介
许多当前的NLP系统和技术将单词视为原子单元 - 没有词与词相似性的概念，就好像它们在词汇表中表示为索引。这个选择有几个好处：简单，鲁棒以及一种现象：依赖大数据量训练得到的简单的模型优于通过较少数据训练的复杂系统。一个例子是流行的N-gram模型用于统计语言建模 - 今天，可能所有的可用数据都在用于训练N-gram模型（[3]）。

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.
然而，这种简单的技术在许多任务中都处于极限。例如，用于训练语音识别模型的数据在特定的领域中数据是很少的。训练效果受限于高质量的语音数据（通常只有几百万词）。在机器翻译领域，很多语言的只包含几十亿的词或者更少。因而，当前的状况是，对简单技术的提升很难取得显著的效果，我们应该关注更先进的技术。
With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].

伴随近些年机器学习技术的发展，在更大数据集上训练更复杂的模型变为了可能，且复杂模型的效果优于简单模型。最成功的四路是词的分布式标识。例如，基于神经网络的语言模型显著的优于N-grams models.

1.1 Goals of the Paper

The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.

本文的目的是介绍再huge data（十亿基本的word，百万级基础词）上训练高质量词向量的技术。就我们目前所知，还没有一个模型可以达到如下效果：在几百万不同的词，词向量为50-100.

We use recently proposed techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity [20]. This has been observed earlier in the context of inflectional languages - for example, nouns can have multiple word endings, and if we search for similar words in a subspace of the original vector space, it is possible to find words that have similar endings [13, 14].

我们用近期发布的技术用于评估向量表示的质量，期望相似的词倾向于同时出现（相邻出现）且词之间有不同的相似度【20】。这种情况在屈折语（ inflectional languages ？）的文本中被更早的发现。例如，名词可以有多个单词结尾，如果我们在一个原始向量空间的子空间中查找，则会会发现词的有相似的结尾【13,14】

Somewhat surprisingly, it was found that similarity of word representations goes beyond simple syntactic regularities. Using a word offset technique where simple algebraic operations are performed on the word vectors, it was shown for example that vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen [20].

还有一些更令人惊讶的发现是，相似的词的表示超出简单语法规则。使用词偏移技术，例如：V(king)-V(man)+V(woman)得到的词向量与 Queen的词向量更相近。【语义推导？】

In this paper, we try to maximize accuracy of these vector operations by developing new model architectures that preserve the linear regularities among words. We design a new comprehensive test set for measuring both syntactic and semantic regularities1 , and show that many such regularities can be learned with high accuracy. Moreover, we discuss how training time and accuracy depends on the dimensionality of the word vectors and on the amount of the training data.

在本文中，我们试图通过改进新的模型框架（保护词的线性规律性）以最大化词向量操作的准确性。我们设计了新的综合性测试集，用于评估语法、语义规律性，并展示许多此类规律性可以高准确的被模型学习到。进而，我们讨论训练时间及准确性如何依赖于词向量的维度及训练集合的大小。

1.2 Previous Work

Representation of words as continuous vectors has a long history [10, 26, 8]. A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.

将词表示为连续向量有很长的历史【10,26,8】。【1】介绍了一种用于评估NNLM的非常流行框架：带线性投射及非线性隐藏层的前反馈的NN，用于学习结合词向量表示及统计语言模型。此工作被很多人跟随。

Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.

【13,14】提到了另一种有趣的NNLM框架，首先通过带一个隐藏层的NN得到词向量，然后词向量被用于训练NNLM。词向量被学习的时候，摆脱了全部的NNLm。在这个工作中，我们直接扩展了这个框架，并集中精力于第一步：通过简单的模型学习到词向量。

It was later shown that the word vectors can be used to significantly improve and simplify many NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison2 . However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices are used [23].

后续会介绍词向量可以用于显著的改进、简化许多NLP的应用[4,5，29】。对词向量自身的评估，通过不同在大量语料库上使用不同的模型框架，其中一些词向量的结果，被用于未来的研究及对比。然而，据我们所知，这些框架计算代价要比单一目的训练的昂贵的多【13】，在某些版本的log-bilinear（逻辑双线性模型）中使用了斜线权重矩阵【23】。

2 Model Architectures

Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). In this paper, we focus on distributed representations of words learned by neural networks, as it was previously shown that they perform significantly better than LSA for preserving linear regularities among words [20, 31]; LDA moreover becomes computationally very expensive on large data sets.

许多不同类型的模型被研究过用于评估词的连续表示，包括LSA、LDA.本文中我们集中于词的分布式表示，其效果在保存词的线性的关系方面线显著优于LSA，相对于LDA则在计算成本偏低。

Similar to [18], to compare different model architectures we define first the computational complexity of a model as the number of parameters that need to be accessed to fully train the model. Next, we will try to maximize the accuracy, while minimizing the computational complexity

与【18】相似，为了比较不同模型框架，首先我们将完整的训练模型的参数用于定义模型计算复杂度；其次，我们最大化准确性，并尽量降低计算复杂性。

Efficient Estimation of Word Representations in Vector Space（翻译）

猜你喜欢