bert 实体识别

参考来源：https://github.com/xuanzebi/BERT-CH-NER

1. 训练语料

bert基于字进行训练，学习获取句子的上下文、语义信息等，具有较好的泛化性。
标注语料示例：非命名实体采用O表示，命名实体根据其属性进行标注，如ORG：组织机构，PER：人名等，可自行定义。实体的第一个字符采用 B- 进行表示，余下字符采用 I- 进行表示。

世 界 动 物 卫 生 组 织 的 英 文 简 称 为 什 么 。
B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O O O

2. 数据集说明

source：训练集中文，target：训练集标签
dev：验证集，dev-lable：验证集标签
test1：测试集，test_tgt：测试集标签

在这里插入图片描述

数据集中共有10个类别
PAD是当句子长度未达到max_seq_length时，补充0的类别
CLS是每个句首前加一个标志[CLS]的类别，SEP是句尾同理

def get_labels(self):
    return ["PAD", "B-LOC", "I-LOC", "B-ORG", "I-ORG", 
            "B-PER", "I-PER", "O", "[CLS]", "[SEP]"]

3. 代码说明

命名实体识别属于序列标注的问题，属于分类问题，主要修改run_classifier.py文件，编写类：NerProcessor，将修改后的文件另存为run_NER.py

class NerProcessor(DataProcessor):

4. 训练

编辑sh命令

export BERT_BASE_DIR=../chinese_L-12_H-768_A-12
export NER_DIR=../tmp
python run_NER.py \
         --task_name=NER \
         --do_train=true \
         --do_eval=true \
         --do_predict=true \
         --data_dir=$NER_DIR/ \
         --vocab_file=$BERT_BASE_DIR/vocab.txt \
         --bert_config_file=$BERT_BASE_DIR/bert_config.json \
         --learning_rate=2e-5 \
         --train_batch_size=32 \
         --num_train_epochs=3 \
         --output_dir=$BERT_BASE_DIR/output \
         --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
         --max_seq_length=256 \     # 根据实际句子长度可调

run_NER.py中tf.flags相关配置，定义数据集、chinese_L-12_H-768_A-12的位置，该配置与sh命令中的配置功能重复，如果想在python中debug调试学习源码时可以修改此处的配置

tf.flags.DEFINE_string(flag_name, default_value, docstring)

flags.DEFINE_string(
    "data_dir", '../tmp',
    "The input data dir. Should contain the .tsv files (or other data files) "
    "for the task.")
flags.DEFINE_string(
    "bert_config_file", '../chinese_L-12_H-768_A-12/bert_config.json',
    "The config json file corresponding to the pre-trained BERT model. "
    "This specifies the model architecture.")
flags.DEFINE_string("vocab_file", '../chinese_L-12_H-768_A-12/vocab.txt',
                    "The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
    "output_dir", '../chinese_L-12_H-768_A-12/output',
    "The output directory where the model checkpoints will be written.")
# Other parameters
flags.DEFINE_string(
    "init_checkpoint", '../chinese_L-12_H-768_A-12/bert_model.ckpt',
    "Initial checkpoint (usually from a pre-trained BERT model).")


if __name__ == "__main__":
    flags.mark_flag_as_required("data_dir")
    flags.mark_flag_as_required("task_name")
    flags.mark_flag_as_required("vocab_file")
    flags.mark_flag_as_required("bert_config_file")
    flags.mark_flag_as_required("output_dir")
    tf.app.run()

在windows中通过git运行sh命令

sh run.sh

5. 输出

checkpoint：记录可用的模型信息
events.out.tfevents*：用于tensorboard查看详细信息
graph.pbtxt：记录tensorflow的结构信息
model.ckpt-0* ：记录最近的三个文件
model.ckpt-0.index ：用于映射图和权重关系
model.ckpt-0.meta：记录完整的计算图结构
predict.tf_record 预测的二进制文件

在这里插入图片描述

6. debug的输出说明

如图，模型执行预测任务，配置如下

do_train=False  
do_eval=False
do_predict=true

预测集共306条，其中第一条记录的信息如下

原始文本：十一月，我们被称为 “ 南京市首届家庭藏书状元明星户 ” 。
tokens（tokenizer.tokenize()实现）：[CLS] 十一月，我们被称为 [UNK] 南京市首届家庭藏书状元明星户 [UNK] 。 [SEP]
input_ids（tokenizer.convert_tokens_to_ids()实现）：101 1282 671 3299 8024 2769 812 6158 4917 711 100 1298 776 2356 7674 2237 2157 2431 5966 741 4307 1039 3209 3215 2787 100 511 102
label_ids（将get_labels的返回的命名实体标签进行编码）：8 7 7 7 7 7 7 7 7 7 7 1 2 2 7 7 7 7 7 7 7 7 7 7 7 7 7 9
segment_ids（0：句子A部分，例如问题部分，1：句子B部分，例如答案，末尾的0是不足128长度的文本进行补0）：0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
由于设置最长序列为128，对于不足128长度的文本进行补0，对于超过最长序列的文本进行尾截断
上述变量将组成一条文本的feature，进行后续操作tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))

在这里插入图片描述

tensorflow-TFRecord 文件

TFRecord ： tensorflow 内置文件格式，二进制文件
将二进制数据和标签(label)存储在同一个文件中
tf.train.Int64List：把 list 中每个元素转换成 key-value 形式
tf.train.Feature：构建一种类型的特征集

def create_int_feature(values):
        f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
        return f

features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature(feature.label_id)
features["is_real_example"] = create_int_feature([int(feature.is_real_example)])

在这里插入图片描述

该段代码将测试集样本由文本转为特征集，并将二进制文件写入predict.tf_record

predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
file_based_convert_examples_to_features(predict_examples, label_list,
                                        FLAGS.max_seq_length, tokenizer,
                                        predict_file)

写入测试集结果

result = estimator.predict(input_fn=predict_input_fn)
fw = open(os.path.join(FLAGS.output_dir, "test_prediction.txt"), 'w', encoding='utf-8')
for i in result:
    output = " ".join(id2label[id] for id in i if id != 0) + "\n"
    fw.write(output)
fw.close()

1. 训练语料

2. 数据集说明

3. 代码说明

4. 训练

5. 输出

6. debug的输出说明

猜你喜欢