目录
【数据集分析】TACRED关系抽取数据集分析(一)—— 理解单条实例
【数据集分析】TACRED关系抽取数据集分析(二)—— 统计类别和实例数
【数据集分析】TACRED关系抽取数据集分析(三)—— Relation Distribution
【数据集分析】TACRED关系抽取数据集分析(四)—— train set 和 valid set中是否有重复数据
最近拿到一个关系抽取数据集,TACRED,分析了一波单条数据、关系分布等,分享一下分析思路和代码。
1. 单条实例分析
{
'label': 'org:founded',
'text': 'Zagat Survey , the guide empire that started as a hobby for Tim and Nina Zagat in 1979 as a two-page typed list of New York restaurants compiled from reviews from friends , has been put up for sale , according to people briefed on the decision .',
'ents': [['Zagat', 1, 5, 0.5], ['1979', 82, 86, 0.5]],
'ann': [['Q140258', 0, 12, 0.57093775], ['Q7804542', 60, 78, 0.532475]]}
可以看到一个instance的格式为json格式,分别是:
{‘label’: ‘’, ’ ',
‘text’: ’ ',
‘ents’: [[头实体, 头实体起始位置, 头实体结束位置, ], [尾实体, 尾实体起始位置, 尾实体结束位置, ]]}
我将数据转化成了一个我喜欢的格式以及key值的命名,这样取数据时对于我就会比较方便,你也可以转换一下,因为我后面几节的分析是基于转化了格式的数据集的数据。
dict
的key
值如下:
{“text”: , “relation”: , “h”: {“id”: , “name”: , “pos”: }, “t”: {“id”: , “name”: , “pos”: }}
转化后一个instance如下:
{
"text":"Zagat Survey , the guide empire that started as a hobby for Tim and Nina Zagat in 1979 as a two-page typed list of New York restaurants compiled from reviews from friends , has been put up for sale , according to people briefed on the decision .",
"relation":"org:founded",
"h":{
"id":"0",
"name":"Zagat",
"pos":[
1,
5
]
},
"t":{
"id":"1",
"name":"1979",
"pos":[
82,
86
]
}
}
NOTE:
- instance的结构组成:由{句子,h,t}三部分。其中 h 和 t 也是
dict
,该dict
包含三部分{id,name,pos}。 - 原数据集合没有h和d的id,因此我分别赋予了0,1给这两个值,在 h 和 t 中我添加了一个pos,意义是头实体或者尾实体的在句子中的position。
- 其实
dict
类型可以用json相互转化,存储和读取比较规范。
2. 代码
import json
train_rel_fre_dict = {
}
train_data = {
}
temp1 = {
}
temp2 = {
}
def convert_dataset(old_path, new_path):
with open(new_path, 'w', encoding = 'utf-8') as f_op:
with open(old_path, 'r', encoding = 'utf-8') as f:
for i in json.load(f):
train_data['text'] = i['text']
train_data['relation'] = i['label']
temp1['id'] = '0'
temp1['name'] = i['ents'][0][0]
temp1['pos'] = [i['ents'][0][1], i['ents'][0][2]]
train_data['h'] = temp1
temp2['id'] = '1'
temp2['name'] = i['ents'][1][0]
temp2['pos'] = [i['ents'][1][1], i['ents'][1][2]]
train_data['t'] = temp2
json.dump(train_data, f_op)
f_op.write('\n')
convert_dataset(train_path, 'tacred_train.txt')
convert_dataset(valid_path, 'tacred_valid.txt')
convert_dataset(test_path, 'tacred_test.txt')
参考感谢
[1] TACRED官网:https://nlp.stanford.edu/projects/tacred/