Stanford 224N- GloVe: Global Vectors for word representations

Window based method (direct predict method): skip gram and CBOW
Skip-gram Model:
take one window at a time,
predict probability of surrrounding words(with existing v and u matrix) 在每个窗口中做SGD
7分钟
大部分目标函数都是非凸的(not convex),因此初始化很重要;要知道避免陷入局部最优的技巧:使用小随机数初始化

SGD更新梯度,sparse matrix,实际上对比的只有窗口中的词语,其他均为0


对于每个窗口,分母需要做20000次(整个语料库的)内积运算? 10分30秒
not efficient: 很多词如 aardvark 和 zebra 并不同时出现,为什么不只看一个窗口内的??
solution:train binary logistic regressions for true pairs (v and u match pair) vs. noise pairs (v paired with a random word)

                                                                         正样本                                               负样本

Word2vec package保留了分子中心词与外围词作内积的想法,
分母变为:从语料库中随机抽取几个单词(k negative samples,10数量级),minimize their prob of co-occuring(出现在中心词周围的概率) to be used as a sum prob
而不是遍历所有不同的单词:在分母中只使用高频词汇的内积和

改进的目标函数算式:见ipad笔记

另一个模型:
Continuous bag of words (CBOW):
Predict the center word from the sum of the surrounding words (instead of predict surrounding words)

what actually happens:
objective functions, take gradients, they cluster similar meanings around in space (PCA visualization)
Go through each word in corpus, try to predict the surrounding words in window
核心为抓取词与词共同出现的频率

Count based method
Example: small corpus -> windows -> compute word vectors
Window based co-occurrence matrices (共现频率计数)
Symmetric window (length 1)

切换到svd分解共现矩阵???

不能真正将上表的行用成word vector: increase in size; high demensional for every word count
部分解决方案:只储存最重要的信息 in a fixed dimension
How to reduce dimentionality from the co-currence matrices ?
SVD (Singlar Value Decomposition)
Simple Python Code to visualize words (dimensionality reduction instead of projecting to 2D)

maximize the count at 100 or ignore a couple of very frequent words (like 'the') or 给予window不同位置的词不同权重
很多重新处理共现矩阵的办法,SVD superisingly works 效果很好

基于计数的方法在中小规模语料训练很快,有效地利用了统计信息,可视化出词与词之间的关系。
而预测模型必须遍历所有的窗口进行训练,无法有效利用单词的全局统计信息

综合两者的优势:GloVe 模型 Global Vectors Model

 
X final (vector) = U + V (vectors in column/row, 中心词还是上下文词)

 

Polysemy一词多义
Word vectors encode similarity + Polysemy

 

  1. Polysemous vectors are superpositioned

tie-1, tie-2, tie-3 -> final combination

  1. Senses can be recovered by sparse coding (+ noise ) How to decomposition this vector afterwards?

  

How to evaluate word vectors?
For these hyperparameters. whether inner products correlate with human judgements of similarity (1 to 10, how similar)
How vector distances correlate with these human judgements. GloVe did the best!

Intrinsic evaluations:
在某个数据集上由人工标注词语或句子相似度与模型结果对比,但可能实际上并没有多少提高效果
on a specific subtask
fast to compute, understand how your system works
Extrinsic evaluations:
通过对外部实际应用的效果提升来体现,耗时较长
用 Pearson 相关系数替代词频计数,表示相关强度
take a long time to train

Word Vector cosine distance: capture semantic and syntactic analogies类比
(Semantic: man to woman, king to ? Sytactic: slow, slower, slowest) through Euclidean subtractions
无数学证明

各模型结果对比:
High dimension vector 表现并不一定好;数据越多越好
Sometimes depend on the data quality (like wikipedia is a very good dataset!)

这些模型都属于Word2Vec吗?? Word2Vec就是Skip-Gram?

调参:How to choose these hyperparameters (plot accuracy vs. parameters change)

  1. Symmetric windows work better
  2. Dimensionality works well after 200 (accuracy flat between 300 to 600)
  3. Window size: about 8 (left, right) is the best

GloVe computes all the counts and then works on the counts
Skip-Gram goes one window at a time
Glove did better, 且迭代次数越多越好

Expensive but best kinds of evaluations are extrinsic evaluations on real task
(NER task is good for that, sentiment analysis is bad, different sentiment can appear on same position)!

Simple single word classification


given word vector x, is it belong to class y or not (classification)?

猜你喜欢

转载自www.cnblogs.com/alexfz/p/10222923.html