原文链接
如果打不开，也可以复制链接到https://nbviewer.jupyter.org中打开。

自然语言处理与词嵌 Operations on word vectors 词向量运算

1-余弦相似度
2-词语类比任务
3-去除词向量中的偏见Debiasing word vectors（可选）
- 3.1-中和与性别无关特定词的偏差
- 3.2-性别词的均衡算法
4-全代码

欢迎来到本周的第一个作业。
因为词嵌入的训练成本非常高，大多数ML实践者都会加载一组预先训练好的词嵌入数据。

完成本次作业后，你将能够

加载预先训练好的词向量，用余弦相似度度量相似度
使用词嵌入来解决单词类比问题，比如男人与女人相比就像国王与____ 相比一样。
修改词嵌入以减少他们的性别偏见

我们开始吧！运行以下代码以加载所需的包。

import numpy as np
from w2v_utils import *

接下来，让我们加载词向量。对于本次任务，我们将使用50维GloVe向量来表示单词。运行下面的代码加载word_to_vec_map。

words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

你加载了

words: 词汇表中的单词集合
word_to_vec_map: 字典，将单词映射到GloVe向量表示

补充：如果运行代码时候遇到以下报错

UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 3136: illegal multibyte sequence

可以修改w2v_utils.py代码来解决

import codecs

def read_glove_vecs(glove_file):
    #with open(glove_file, 'r') as f:
    with open(glove_file, 'r',encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {
    
    }
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            
    return words, word_to_vec_map

1-余弦相似度

为了度量两个单词的相似度，我们需要一种方法来度量两个单词的嵌入向量之间的相似度。给定两个向量 $u$ 和 $v$ ，余弦相似性定义如下：

$\frac{u \cdot v}{||u||_2 ||v||_2} = cos(\theta)\tag{1}$

其中， $u . v$ 是两个向量的点积（或内积）， $u||_2$ 是向量 $u$ 的范数（或长度）， $\theta$ 是 $u$ 和 $v$ 之间的角度。这种相似性取决于 $u$ 和 $v$ 之间的角度。如果 $u$ 和 $v$ 非常相似，它们的余弦相似性将接近1；如果它们不相似，余弦相似性将取较小的值。
在这里插入图片描述

上图是两个向量之间的夹角的余弦值用来衡量它们相似程度。

练习：实现cosine_similarity()函数。评估词向量之间的相似性

提示：： $u$ 的范数是这样定义的： $||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

# GRADED FUNCTION: cosine_similarity

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v
    u与v的余弦相似度反映了u与v的相似程度
        
    Arguments:
        u -- a word vector of shape (n,)      维度为(n,)的词向量 
        v -- a word vector of shape (n,)	维度为(n,)的词向量

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
	由上面公式定义的u和v之间的余弦相似度。
    """
    
    distance = 0.0
    
    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
	# 计算u与v的内积
    dot = np.dot(u, v)

    # Compute the L2 norm of u (≈1 line)
	#计算u的L2范数
    norm_u = np.sqrt(np.sum(np.power(u, 2)))
    
    # Compute the L2 norm of v (≈1 line)
	#计算v的L2范数
    norm_v = np.sqrt(np.sum(np.square(v)))

    # Compute the cosine similarity defined by formula (1) (≈1 line)	
	# 根据公式1计算余弦相似度
    cosine_similarity = np.divide(dot, norm_u*norm_v)
    ### END CODE HERE ###
    
    return cosine_similarity

测试一下

father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]

print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))

结果

cosine_similarity(father, mother) =  0.8909038442893615
cosine_similarity(ball, crocodile) =  0.27439246261379424
cosine_similarity(france - paris, rome - italy) =  -0.6751479308174202

在得到正确的预期输出后，请随意修改输入并测量其他单词对之间的余弦相似度！
利用其他输入的余弦相似性可以更好地了解词向量的行为。

2-词语类比任务

在单词类比任务中，我们完成句子。
具体地说，我们试图找到一个单词d，使得相关的单词向量 $e_a、e_b、e_c、e_d$ 以以下方式相关： $e_b-e_a \approx e_d-e_c$ 。我们将使用余弦相似度来度量 $e_b-e_a$ 和 $e_d-e_c$ 之间的相似度。

练习：完成下面的代码，以便能够执行单词类比

# GRADED FUNCTION: complete_analogy

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    解决“A与B相比就类似于C与____相比一样”之类的问题
    
    Arguments:
    word_a -- a word, string 一个字符串类型的词
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
	字典类型，单词到GloVe向量的映射
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
	满足(v_b - v_a) 最接近 (v_best_word - v_c) 的词
    """
    
    # convert words to lower case
	# 把单词转换为小写
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
	# 获取对应单词的词向量
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    ### END CODE HERE ###
    
    words = word_to_vec_map.keys() # 获取全部的单词

	# 将max_cosine_sim初始化为一个比较大的负数
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set # 遍历整个数据集
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
	# 要避免匹配到输入的数据
        if w in [word_a, word_b, word_c] :
            continue
        
        ### START CODE HERE ###
        # Compute cosine similarity between the combined_vector and the current word (≈1 line)
	# 计算余弦相似度
        cosine_sim = cosine_similarity((e_b-e_a), (word_to_vec_map[w]-e_c))
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        ### END CODE HERE ###
        
    return best_word

测试一下

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{
    
    } -> {
    
    } :: {
    
    } -> {
    
    }'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

结果

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger

一旦你得到了正确的预期输出，请随意修改上面的输入词汇来测试你自己的类比。试着找到一些其他的类比对，但也找到一些算法不能给出正确答案的地方：例如，你可以尝试small->smaller as big->？。

恭喜！
你已经完成了以上练习。以下内容请记住

余弦相似度是比较词向量对之间相似度的好方法。（尽管L2距离也适用。）
对于NLP应用程序，使用预先训练好的一组来自internet的单词向量通常是一个很好的入门方法。

3-去除词向量中的偏见Debiasing word vectors（可选）

在下面的练习中，你将研究可以反映在词嵌入中的性别偏见，并探索减少偏见的算法。除了学习去除偏见debiasing这个主题之外，这个练习还将有助于磨练你对词向量所做事情的直觉。这一节涉及到一点线性代数，尽管你可能不需要精通线性代数就可以完成它，我们鼓励你尝试一下。

让我们先看看GloVe词嵌入与性别的关系。
首先计算一个向量 $g=e_{woman}-e_{man}$ ，其中 $_{woman}$ 表示对应于单词woman的词向量， $e_{man}$ 对应于单词man的词向量。得到的结果向量 $g$ 粗略地编码了“性别”的概念。（如果你计算 $g_1=e_{mother}-e_{father}$ ， $g_2=e_{girl}-e_{boy孩}$ ，等等并对它们进行平均，你可能会得到更精确的表示。但是现在只要使用 $e_{woman}-e_{man}$ 就可以得到足够好的结果。）

g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)

结果

[-0.087144    0.2182     -0.40986    -0.03922    -0.1032      0.94165
 -0.06042     0.32988     0.46144    -0.35962     0.31102    -0.86824
  0.96006     0.01073     0.24337     0.08193    -1.02722    -0.21122
  0.695044   -0.00222     0.29106     0.5053     -0.099454    0.40445
  0.30181     0.1355     -0.0606     -0.07131    -0.19245    -0.06115
 -0.3204      0.07165    -0.13337    -0.25068714 -0.14293    -0.224957
 -0.149       0.048882    0.12191    -0.27362    -0.165476   -0.20426
  0.54376    -0.271425   -0.10245    -0.32108     0.2516     -0.33455
 -0.04371     0.01258   ]

现在，你将考虑不同单词与 $g$ 的余弦相似性。考虑一下正的相似度值和负的余弦相似度值是什么意思。

print ('List of names and their similarities with constructed vector:')

# girls and boys name
name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

结果

List of names and their similarities with constructed vector:
john -0.23163356145973724
marie 0.315597935396073
sophie 0.31868789859418784
ronaldo -0.3124479685032943
priya 0.17632041839009402
rahul -0.16915471039231722
danielle 0.24393299216283892
reza -0.07930429672199552
katy 0.2831068659572615
yasmin 0.23313857767928758

如你所见，女性的名字往往与我们构建的向量 $g$ 具有正余弦相似性，而男性的名字往往具有负余弦相似性。这并不奇怪，结果似乎可以接受。

但让我们试着用别的词。

print('Other words and their similarities:')
word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist', 
             'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

结果

Other words and their similarities:
lipstick 0.2769191625638266
guns -0.1888485567898898
science -0.060829065409296994
arts 0.008189312385880328
literature 0.06472504433459927
warrior -0.20920164641125288
doctor 0.11895289410935041
tree -0.07089399175478091
receptionist 0.33077941750593737
technology -0.13193732447554296
fashion 0.03563894625772699
teacher 0.17920923431825664
engineer -0.08039280494524072
pilot 0.0010764498991916787
computer -0.10330358873850498
singer 0.1850051813649629

你注意到什么奇怪的事了吗？令人惊讶的是，这些结果如何反映出某些不健康的性别陈规定型观念。例如，“电脑”更接近“男人”，而“文学”更接近“女人”。

我们将在下面看到如何使用Boliukbasi等人2016年提出的算法来减少这些向量的偏差。请注意，有些词对，如“actor”/“actor”或“grandma”/“grandman”，应保持性别特异性，而其他词如“recepoint”或“technology”应保持中性，即与性别无关。在去除偏差时，你必须区别对待这两类词。

3.1-中和与性别无关特定词的偏差

下图可以帮助你想象中和的作用。
在这里插入图片描述

如果你使用的是50维单词嵌入，那么50维空间可以分为两部分：偏移方向 $g$ 和剩余的49维，我们称之为 $g{\perp}$ 。在线性代数中，我们说49维的 $g{\perp}$ 与 $g$ 垂直（或“正交”），也就是说它与 $g$ 成90度角。消除偏差步骤采用一个向量，比如 $e_{receptionister}$ ，将 $e_{receptionister}$ 沿着 $g$ 方向归零，得到 $e_{receptionister}^{debiased}$ 。

尽管 $g{\perp}$ 是49维的，但是考虑到我们在屏幕上绘制的内容的局限性，我们在上图中用一个一维轴来说明它。

练习：实现neutralize()函数，中和与性别无关特定词的偏差，例如"receptionist" 或者"scientist"。给定一个词嵌入 $e$ 的输入，你可以使用以下公式来计算 $e^{debiased}$ ：
$e^{bias\_component}=\frac{e⋅g}{∣∣g∣∣^2_2}∗g\tag{2}$
$e^{debiased} = e - e^{bias\_component} \tag{3}$
如果你是线性代数方面的专家，你可能会认为 $e^{bias\_component}$ 是 $e$ 在 $g$ 方向的投影。如果你不是线性代数的专家，别担心这个。

def neutralize(word, g, word_to_vec_map):
    """
    Removes the bias of "word" by projecting it on the space orthogonal to the bias axis. 
    This function ensures that gender neutral words are zero in the gender subspace.
    通过将“word”投影到与偏置轴正交的空间上，消除了“word”的偏差。
	该函数确保“word”在性别的子空间中的值为0

    Arguments:
        word -- string indicating the word to debias待消除偏差的字符串
        g -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)
	维度为(50,)，对应于偏置轴（如性别）
        word_to_vec_map -- dictionary mapping words to their corresponding vectors.
	字典类型，单词到GloVe向量的映射
    
    Returns:
        e_debiased -- neutralized word vector representation of the input "word"
	消除了偏差的向量。
    """
    
    ### START CODE HERE ###
    # Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
	# 根据word选择对应的词向量
    e = word_to_vec_map[word]
    
    # Compute e_biascomponent using the formula give above. (≈ 1 line)
	# 根据公式2计算e_biascomponent
    e_biascomponent = np.divide(np.dot(e, g), np.square(np.linalg.norm(g))) * g
 
    # Neutralize e by substracting e_biascomponent from it 
    # e_debiased should be equal to its orthogonal projection. (≈ 1 line)
	# 根据公式3计算e_debiased
    e_debiased = e - e_biascomponent
    ### END CODE HERE ###
    
    return e_debiased

测试一下

e = "receptionist"
print("cosine similarity between " + e + " and g, before neutralizing: ", cosine_similarity(word_to_vec_map["receptionist"], g))

e_debiased = neutralize("receptionist", g, word_to_vec_map)
print("cosine similarity between " + e + " and g, after neutralizing: ", cosine_similarity(e_debiased, g))

结果

cosine similarity between receptionist and g, before neutralizing:  0.33077941750593737
cosine similarity between receptionist and g, after neutralizing:  1.1682064664487028e-17

第二个结果本质上是0（大约 $10^{-17}$ ），既不偏man，也不偏woman。

3.2-性别词的均衡算法

下一步，让我们看看debiasing如何应用于单词对，比如“actress”和“actor”。
假设“女演员”比“男演员”更接近“保姆”，通过对“保姆”进行中和（上一小节方法），我们可以减少与保姆相关的性别刻板印象。但这仍然不能保证“男演员”和“女演员”与“保姆”的距离相等，均衡算法解决了这个问题。

均衡背后的关键思想是确保特定的一对单词与49维 $g\perp$ 的距离相等。均衡步骤还确保两个均衡步骤现在与 $e_{receptionister}^{debiased}$ 或任何其他已被中和的工作的距离相同。在下图中，均衡是这样工作的：
在这里插入图片描述
线性代数的推导要复杂一些。（详见Bolukbasi等人，2016年）但关键方程式如下：
$μ=\frac{e_{w1}+e_{w2}}2\tag{4}$

$bias_axis ∣ ∣ bias_axis ∣ ∣ 2 2 ∗ bias_axis (5) \mu_{B} = \frac{\mu \cdot \text{bias\_axis}} {||\text{bias\_axis}||_2^2} *\text{bias\_axis}\tag{5}$

$\mu_{\perp} = \mu - \mu_{B} \tag{6}$

$bias_axis ∣ ∣ bias_axis ∣ ∣ 2 2 ∗ bias_axis (7) e_{w1B} = \frac {e_{w1} \cdot \text{bias\_axis}}{||\text{bias\_axis}||_2^2} *\text{bias\_axis} \tag{7}$

$bias_axis ∣ ∣ bias_axis ∣ ∣ 2 2 ∗ bias_axis (8) e_{w2B} = \frac {e_{w2} \cdot \text{bias\_axis}}{||\text{bias\_axis}||_2^2} *\text{bias\_axis} \tag{8}$

$e_{w1B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w1B}} - \mu_B} {|(e_{w1} - \mu_{\perp}) - \mu_B)|} \tag{9}$

$e_{w2B}^{corrected} = \sqrt{ |{1 - ||\mu_{\perp} ||^2_2} |} * \frac{e_{\text{w2B}} - \mu_B} {|(e_{w2} - \mu_{\perp}) - \mu_B)|} \tag{10}$

$e_1 = e_{w1B}^{corrected} + \mu_{\perp} \tag{11}$

$e_2 = e_{w2B}^{corrected} + \mu_{\perp} \tag{12}$

练习：实现以下功能。使用上面的等式得到这对单词的最终均衡版本。

def equalize(pair, bias_axis, word_to_vec_map):
    """
    Debias gender specific words by following the equalize method described in the figure above.
	通过遵循上图中所描述的均衡方法来消除性别偏差。
    
    Arguments:
    pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor") 
	要消除性别偏差的词组，比如 ("actress", "actor") 
    bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. gender
	维度为(50,)，对应于偏置轴（如性别）
    word_to_vec_map -- dictionary mapping words to their corresponding vectors
	字典类型，单词到GloVe向量的映射
    
    Returns
    e_1 -- word vector corresponding to the first word 第一个词的词向量
    e_2 -- word vector corresponding to the second word
    """
    
    ### START CODE HERE ###
    # Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)
	# 第1步：获取词向量
    w1, w2 = pair
    e_w1, e_w2 = word_to_vec_map[w1], word_to_vec_map[w2]
    
    # Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
	# 第2步：计算w1与w2的均值
    mu = (e_w1 + e_w2)/2.0

    # Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
	# 第3步：计算mu在偏置轴与正交轴上的投影
    mu_B = np.divide(np.dot(mu, bias_axis), np.square(np.linalg.norm(bias_axis))) * bias_axis
    mu_orth = mu - mu_B

    # Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
	# 第4步：使用公式7、8计算e_w1B 与 e_w2B
    e_w1B = np.divide(np.dot(e_w1, bias_axis), np.square(np.linalg.norm(bias_axis))) * bias_axis
    e_w2B = np.divide(np.dot(e_w2, bias_axis), np.square(np.linalg.norm(bias_axis))) * bias_axis
        
    # Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)
	# 第5步：根据公式9、10调整e_w1B 与 e_w2B的偏置部分
    corrected_e_w1B = np.sqrt(np.abs(1-np.square(np.linalg.norm(mu_orth)))) * np.divide((e_w1B-mu_B), np.abs(e_w1-mu_orth-mu_B))
    corrected_e_w2B = np.sqrt(np.abs(1-np.square(np.linalg.norm(mu_orth)))) * np.divide((e_w2B-mu_B), np.abs(e_w2-mu_orth-mu_B))

    # Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
	# 第6步： 使e1和e2等于它们修正后的投影之和，从而消除偏差
    e1 = corrected_e_w1B + mu_orth
    e2 = corrected_e_w2B + mu_orth
                                                                
    ### END CODE HERE ###
    
    return e1, e2

测试一下

print("cosine similarities before equalizing:")
print("cosine_similarity(word_to_vec_map[\"man\"], gender) = ", cosine_similarity(word_to_vec_map["man"], g))
print("cosine_similarity(word_to_vec_map[\"woman\"], gender) = ", cosine_similarity(word_to_vec_map["woman"], g))
print()
e1, e2 = equalize(("man", "woman"), g, word_to_vec_map)
print("cosine similarities after equalizing:")
print("cosine_similarity(e1, gender) = ", cosine_similarity(e1, g))
print("cosine_similarity(e2, gender) = ", cosine_similarity(e2, g))

结果

cosine similarities before equalizing:
cosine_similarity(word_to_vec_map["man"], gender) =  -0.1171109576533683
cosine_similarity(word_to_vec_map["woman"], gender) =  0.3566661884627037

cosine similarities after equalizing:
cosine_similarity(e1, gender) =  -0.7165727525843935
cosine_similarity(e2, gender) =  0.7396596474928908

请随意修改上面代码中的输入单词，对其他单词进行均衡。

这些debiasing算法对减少偏差非常有帮助，但并不完美，不能消除所有偏差痕迹。
例如，一个缺点是，偏移方向 $g$ 仅使用单词woman和man来定义。如前所述，如果通过计算 $g_1=e_{woman}-e_{man}$ ； $g_2=e_{mother}-e_{father}$ ； $g_3=e_{girl}-e_{boy}$ ；并对其进行平均来定义 $g= avg(g_1, g_2, g_3)$ ，那么在50维单词嵌入空间中，你将获得对“性别”维度的更好估计。

恭喜！
恭喜你完成了本练习。你已经看到了许多使用和修改单词向量的方法。

4-全代码

链接

2021-1-9 吴恩达-C5 序列模型-w2 自然语言处理与词嵌(课后编程1-Operations on word vectors 词向量运算-含UnicodeDecodeError解决)

自然语言处理与词嵌 Operations on word vectors 词向量运算

1-余弦相似度

2-词语类比任务

3-去除词向量中的偏见Debiasing word vectors（可选）

3.1-中和与性别无关特定词的偏差

3.2-性别词的均衡算法

4-全代码

目录

自然语言处理与词嵌 Operations on word vectors 词向量运算

1-余弦相似度

2-词语类比任务

3-去除词向量中的偏见Debiasing word vectors（可选）

3.1-中和与性别无关特定词的偏差

3.2-性别词的均衡算法

4-全代码

猜你喜欢

目录

热门文章