「这是我参与11月更文挑战的第22天,活动详情查看:2021最后一次更文挑战」
AmpliGraph是埃森哲实验室开发的一个基于TensorFlow的开源库,用于预测知识图谱中概念之间的联系。它是一个神经机器学习模型的集合,用于统计关系学习(statistical relational learning)(SRL)(也称为关系机器学习)--这是AI/ML的一个子学科,涉及对知识图谱的监督学习。AmpliGraph有以下特征:(1)从现有知识图谱中发现新知识;(2)使用缺少的语句来完成大型知识图谱;(3)生成独立的知识图谱嵌入;(4)开发和评估新的关系模型。
1. 简介和准备工作
在本动手教程中,我们将使用开源库AmpliGraph.
让我们首先安装该库及其依赖项,然后导入本教程中使用的库。
# Install CUDA
conda install -y cudatoolkit=10.0
# Install cudnn libraries
conda install cudnn=7.6
# Install tensorflow GPU
pip install tensorflow-gpu==1.15.3
复制代码
检查tensorflow 和 GPU
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
复制代码
安装AmpliGraph和其他依赖项
# Install AmpliGraph library
pip install ampligraph
# Required to visualize embeddings with tensorboard projector, comment out if not required!
pip install --user tensorboard
# Required to plot text on embedding clusters, comment out if not required!
pip install --user git+https://github.com/Phlya/adjustText
复制代码
2. 加载知识图谱数据集
首先,我们需要一个知识图,因此让我们加载一个称为Freebase-15k-237的标准知识图。
Ampligraph提供了一系列API来 加载标准知识图谱.
还提供了一系列API,用于加载csv,ntriples和rdf格式。详细信息可以点击这里
from ampligraph.datasets import load_fb15k_237, load_wn18rr, load_yago3_10
复制代码
重新映射了freebase237的id,并创建了一个csv文件,其中包含可读的名称而不是id。
import pandas as pd
URL = './data/freebase-237-merged-and-remapped.csv'
dataset = pd.read_csv(URL, header=None)
dataset.columns = ['subject', 'predicate', 'object']
dataset.head(5)
复制代码
print('Total triples in the KG:', dataset.shape)
复制代码
Total triples in the KG: (310079, 3)

【创建训练集、验证集、测试集】
使用Ampligraph提供的train_test_split_no_unseen函数来创建训练集、验证集、测试集。
from ampligraph.evaluation import train_test_split_no_unseen
# get the validation set of size 500
test_train, X_valid = train_test_split_no_unseen(dataset.values, 500, seed=0)
# get the test set of size 1000 from the remaining triples
X_train, X_test = train_test_split_no_unseen(test_train, 1000, seed=0)
print('Total triples:', dataset.shape)
print('Size of train:', X_train.shape)
print('Size of valid:', X_valid.shape)
print('Size of test:', X_test.shape)
复制代码
Total triples: (310079, 3)
Size of train: (308579, 3)
Size of valid: (500, 3)
Size of test: (1000, 3)\
3. 模型训练
现在,我们已经分割了数据集,让我们直接进入模型训练。
创建一个TransE模型,并使用fit函数在训练分组上对其进行训练。
TransE是最早为KGE研究奠定平台的嵌入式模型之一。它使用简单的矢量代数对三元组进行评分。与大多数模型相比,它的可训练参数数量非常少。
from ampligraph.latent_features import TransE
model = TransE(k=150, # embedding size
epochs=100, # Num of epochs
batches_count= 10, # Number of batches
eta=1, # number of corruptions to generate during training
loss='pairwise', loss_params={'margin': 1}, # loss type and it's hyperparameters
initializer='xavier', initializer_params={'uniform': False}, # initializer type and it's hyperparameters
regularizer='LP', regularizer_params= {'lambda': 0.001, 'p': 3}, # regularizer along with its hyperparameters
optimizer= 'adam', optimizer_params= {'lr': 0.001}, # optimizer to use along with its hyperparameters
seed= 0, verbose=True)
model.fit(X_train)
from ampligraph.utils import save_model, restore_model
save_model(model, 'TransE-small.pkl')
复制代码
参考此链接了解参数及其值的详细说明。
【计算评估指标】 score: 这是模型通过应用评分功能分配给三元组的值。
test_triple = ['harrison ford',
'/film/actor/film./film/performance/film',
'star wars']
triple_score = model.predict(test_triple)
print('Triple of interest:\n', test_triple)
print('Triple Score:\n', triple_score)
复制代码
Triple of interest:
['harrison ford', '/film/actor/film./film/performance/film', 'star wars']
Triple Score:
[-8.270267]
import numpy as np
list_of_actors = ['salma hayek', 'carrie fisher', 'natalie portman', 'kristen bell',
'mark hamill', 'neil patrick harris', 'harrison ford' ]
# stack it horizontally to create s, p, o
hypothesis = np.column_stack([list_of_actors,
['/film/actor/film./film/performance/film'] * len(list_of_actors),
['star wars'] * len(list_of_actors),
])
# score the hypothesis
triple_scores = model.predict(hypothesis)
# append the scores column
scored_hypothesis = np.column_stack([hypothesis, triple_scores])
# sort by score in descending order
scored_hypothesis = scored_hypothesis[np.argsort(scored_hypothesis[:, 3])]
scored_hypothesis
复制代码
结果
array([['harrison ford', '/film/actor/film./film/performance/film',
'star wars', '-8.270266'],
['carrie fisher', '/film/actor/film./film/performance/film',
'star wars', '-8.357192'],
['natalie portman', '/film/actor/film./film/performance/film',
'star wars', '-8.739484'],
['neil patrick harris', '/film/actor/film./film/performance/film',
'star wars', '-9.089647'],
['mark hamill', '/film/actor/film./film/performance/film',
'star wars', '-9.17255'],
['salma hayek', '/film/actor/film./film/performance/film',
'star wars', '-9.205964'],
['kristen bell', '/film/actor/film./film/performance/film',
'star wars', '-9.764657']], dtype='<U39')
复制代码
根据上面的排序可知,我们对每一个三元组有了一个分数,可计算出它们的排名如下: 找到hypothesis score在sub_corr_score中的位置,得到sub_rank
sub_rank_worst = np.sum(np.greater_equal(sub_corr_score, triple_score[0])) + 1
复制代码
Assigning the worst rank (to break ties): 1655
X_test_small = np.array(
[['doctorate',
'/education/educational_degree/people_with_this_degree./education/education/major_field_of_study',
'computer engineering'],
['star wars',
'/film/film/estimated_budget./measurement_unit/dated_money_value/currency',
'united states dollar'],
['harry potter and the chamber of secrets',
'/film/film/estimated_budget./measurement_unit/dated_money_value/currency',
'united states dollar'],
['star wars', '/film/film/language', 'english language'],
['harrison ford', '/film/actor/film./film/performance/film', 'star wars']])
X_filter = np.concatenate([X_train, X_valid, X_test], 0)
ranks = evaluate_performance(X_test_small,
model=model,
filter_triples=X_filter,
corrupt_side='s,o')
print(ranks)
复制代码
结果
[[ 9 5]
[ 1 1]
[ 77 1]
[ 2 2]
[1644 833]]
复制代码
Mean rank(MR):顾名思义,是三元组所有rank的平均值。取值范围从1(当所有级别都等于1时的理想情况)到最坏情况(所有级别都在最后)。
from ampligraph.evaluation import mr_score
print('MR :', mr_score(ranks))
复制代码
MR:257.5
Mean reciprocal rank (MRR) 是所有三元组rank倒数的平均值。取值范围为0 ~ 1;值越高,模型越好。
from ampligraph.evaluation import mrr_score
print('MRR :', mrr_score(ranks))
复制代码
MRR:0.4325906876796283
hits@n表示计算的排名大于或等于n的排名的百分比。取值范围为0 ~ 1;值越高,模型越好。
from ampligraph.evaluation import hits_at_n_score
print('hits@1 :', hits_at_n_score(ranks, 1))
print('hits@10 :', hits_at_n_score(ranks, 10))
复制代码
hits@1 : 0.3
hits@10 : 0.7
封装为一个函数调用
def display_aggregate_metrics(ranks):
print('Mean Rank:', mr_score(ranks))
print('Mean Reciprocal Rank:', mrr_score(ranks))
print('Hits@1:', hits_at_n_score(ranks, 1))
print('Hits@10:', hits_at_n_score(ranks, 10))
print('Hits@100:', hits_at_n_score(ranks, 100))
display_aggregate_metrics(ranks)
复制代码
Mean Rank: 257.5
Mean Reciprocal Rank: 0.4325906876796283
Hits@1: 0.3
Hits@10: 0.7
Hits@100: 0.8
4. 训练和early stopping
在训练模型时,我们希望确保模型对数据没有过拟合或过拟合。如果我们针对固定数量的时间段训练一个模型,我们将不知道模型是否对训练数据进行了欠拟合或过拟合。因此,有必要在固定的时间间隔上测试模型的性能,以决定何时停止训练。这被称为“early stopping”,即我们不让模型运行很长时间,而是在性能开始下降之前就停止。
然而,我们也不希望模型对保留的集合进行过拟合,从而限制模型的泛化能力。因此,我们应该创建一个验证集和一个测试集来验证模型的泛化能力,并确保我们不会对数据进行过拟合和过拟合。
early_stopping_params = { 'x_valid': X_valid, # Validation set on which early stopping will be performed
'criteria': 'mrr', # metric to watch during early stopping
'burn_in': 150, # Burn in time, i.e. early stopping checks will not be performed till 150 epochs
'check_interval': 50, # After burn in time, early stopping checks will be performed at every 50th epochs (i.e. 150, 200, 250, ...)
'stop_interval': 2, # If the monitored criteria degrades for these many epochs, the training stops.
'corrupt_side':'s,o' # Which sides to corrupt furing early stopping evaluation (default both subject and obj as described earlier)
}
# create a model as earlier
model = TransE(k=100,
epochs=10000,
eta=1,
loss='multiclass_nll',
initializer='xavier', initializer_params={'uniform': False},
regularizer='LP', regularizer_params= {'lambda': 0.0001, 'p': 3},
optimizer= 'adam', optimizer_params= {'lr': 0.001},
seed= 0, batches_count= 1, verbose=True)
# call model.fit by passing early stopping params
model.fit(X_train, # training set
early_stopping=True, # set early stopping to true
early_stopping_params=early_stopping_params) # pass the early stopping params
# evaluate the model with filter
X_filter = np.concatenate([X_train, X_valid, X_test], 0)
ranks = evaluate_performance(X_test,
model=model,
filter_triples=X_filter)
# display the metrics
display_aggregate_metrics(ranks)
复制代码