基于BERT命名实体识别代码的理解

我一直做的是有关实体识别的任务，BERT已经火了有一段时间，也研究过一点，今天将自己对bert对识别实体的简单认识记录下来，希望与大家进行来讨论

BERT官方Github地址：https://github.com/google-research/bert ，其中对BERT模型进行了详细的介绍，更详细的可以查阅原文献：https://arxiv.org/abs/1810.04805

bert可以简单地理解成两段式的nlp模型，（1）pre_training：即预训练，相当于wordembedding，利用没有任何标记的语料训练一个模型；（2）fine-tuning：即微调，利用现有的训练好的模型，根据不同的任务，输入不同，修改输出的部分，即可完成下游的一些任务（如命名实体识别、文本分类、相似度计算等等）
本文是在官网上给定的run_classifier.py中进行修改从而完成命名实体识别的任务

代码的解读，将主要的几个代码进行简单的解读

1、主函数

if __name__ == "__main__":
    flags.mark_flag_as_required("data_dir")
    flags.mark_flag_as_required("task_name")
    flags.mark_flag_as_required("vocab_file")
    flags.mark_flag_as_required("bert_config_file")
    flags.mark_flag_as_required("output_dir")
    tf.app.run()

主函数中指定了一些必须不能少的参数
data_dir:指的是我们的输入数据的文件夹路径
task_name:任务的名字
vocab_file:字典，一般从下载的模型中直接包含这个字典，名字“vocab.txt”
bert_config_file:一些预训练好的配置参数，同样在下载的模型文件夹中，名字为“bert_config.json”
output_dir:输出文件保存的位置

2、main(_)函数

processors = {
        "ner": NerProcessor
    }
task_name = FLAGS.task_name.lower()  
processor = processors[task_name]()

上面代码中的task_name是用来选择processor的
NerProcessor的代码如下：

class NerProcessor(DataProcessor):  ##数据的读入
    def get_train_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "train.txt")), "train"
        )

    def get_dev_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "dev.txt")), "dev"
        )

    def get_test_examples(self, data_dir):
        return self._create_example(
            self._read_data(os.path.join(data_dir, "test.txt")), "test")

    def get_labels(self):

        # 9个类别
        return ["O", "B-dizhi", "I-dizhi", "B-shouduan", "I-shouduan", "B-caiwu", "I-caiwu", "B-riqi", "I-riqi", "X",
                "[CLS]", "[SEP]"]

    def _create_example(self, lines, set_type):
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s-%s" % (set_type, i)
            text = tokenization.convert_to_unicode(line[1])
            label = tokenization.convert_to_unicode(line[0])
            if i == 0:
            examples.append(InputExample(guid=guid, text=text, label=label))
        return examples

上面的代码主要是完成了数据的读入，且继承了DataProcessor这个类，_read_data()函数是在父类DataProcessor中实现的，具体的代码如下所示：

class DataProcessor(object):
    """Base class for data converters for sequence classification data sets."""

    def get_train_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of `InputExample`s for the dev set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    @classmethod
    def _read_data(cls, input_file):
        """Reads a BIO data."""
        with codecs.open(input_file, 'r', encoding='utf-8') as f:
            lines = []
            words = []
            labels = []
            for line in f:
                contends = line.strip()
                tokens = contends.split()  ##根据不同的语料，此处的split()划分标志需要进行更改
                # print(len(tokens))
                if len(tokens) == 2:
                    word = line.strip().split()[0]  ##根据不同的语料，此处的split()划分标志需要进行更改
                    label = line.strip().split()[-1]  ##根据不同的语料，此处的split()划分标志需要进行更改
                else:
                    if len(contends) == 0:
                        l = ' '.join([label for label in labels if len(label) > 0])
                        w = ' '.join([word for word in words if len(word) > 0])
                        lines.append([l, w])
                        words = []
                        labels = []
                        continue
                if contends.startswith("-DOCSTART-"):
                    words.append('')
                    continue
                words.append(word)
                labels.append(label)

            return lines  ##(label,word)

_read_data()函数：主要是针对NER的任务进行改写的，将输入的数据中的字存储到words中，标签存储到labels中，将一句话中所有字以空格隔开组成一个字符串放入到w中，同理标签放到l中，同时将w与l放到lines中，具体的代码如下所示：

l = ' '.join([label for label in labels if len(label) > 0])
w = ' '.join([word for word in words if len(word) > 0])
lines.append([l, w])

def get_labels(self)：是将标签返回，会在原来标签的基础之上多添加"X","[CLS]", "[SEP]"这三个标签，句子开始设置CLS 标志，句尾添加[SEP] 标志,"X"表示的是英文中缩写拆分时，拆分出的几个部分，除了第1部分，其他的都标记为"X"

代码中使用了InputExample类

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text, label=None):
        """Constructs a InputExample. ##构造BLSTM_CRF一个输入的例子
        Args:
          guid: Unique id for the example.
          text: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
          label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text = text
        self.label = label

我的理解是这个是输入数据的一个封装，不管要处理的是什么任务，需要经过这一步，对输入的格式进行统一一下
guid是一种标识，标识的是test、train、dev

暂时更新到这个地方，后续会继续更新