手把手教你用BERT做NER命名实体识别

一，从GitHub下载Bert源码和模型

关于Bert的详细介绍和如何下载此处就不再赘述了。github-bert网址如下：GitHub-Bert，下载bert代码和模型。

1，下载Bert代码

终端执行： git clone https://github.com/google-research/bert.git

2，下载中文-base模型

Bert-Base中文模型地址
将模型解压到prev_trained_model/chinese_L-12_H-768_A-12目录中

二，下载数据

1，下载demo数据集

Bert模型-Ner任务demo数据集，下载地址： https://download.csdn.net/download/TFATS/23512817

2，创建数据集存储目录

在run_classifier.py文件同级目录中创建目录：ner_dataset/ner，将下载的demo数据集解压到该目录中，该目录也可自由指定。

三，创建run_classifier.sh文件

为方面启动和调参，创建run_classifier.sh文件。
注意：该文件中的预训练模型加载路径、各数据集加载路径、输出checkpoint路径需和真实路径对应。

#!/usr/bin/env bash
# @Author: nijiahui
# @Date:   2021-09-07 14:19:36

TASK_NAME="ner"
MODEL_NAME="chinese_L-12_H-768_A-12"

CURRENT_DIR=$(cd -P -- "$(dirname -- "$0")" && pwd -P)
# # 【horovod】get GPU - number
# gpu_num=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

export PRETRAINED_MODELS_DIR=$CURRENT_DIR/prev_trained_model
export BERT_BASE_DIR=$PRETRAINED_MODELS_DIR/$MODEL_NAME
export DATA_DIR=$CURRENT_DIR/ner_dataset
# 【减少打印】
export OMPI_MCA_btl_vader_single_copy_mechanism='none'

# run task
cd $CURRENT_DIR
echo "Start running..."

python run_classifier.py \
  --task_name=$TASK_NAME \
  --do_train=true \
  --do_eval=true \
  --do_predict=true \
  --data_dir=$DATA_DIR/$TASK_NAME \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=48 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=1.0 \
  --output_dir=$CURRENT_DIR/${TASK_NAME}_output \

四，修改processors

由于我们对输入数据格式和label做了自定义的修改，所以对run_classifier.py文件中的processors对象做如下修改：

1，自定义NerProcessor

def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)

    # 【一，数据构造器 - 定义 NerProcessor 】
    processors = {
    
    
      "ner": NerProcessor
    }
    ... ...

2，对NerProcessor类进行修改

class NerProcessor(DataProcessor):
    """Processor for the XNLI data set."""
    ... ...
    def get_labels(self):
    	clue_labels = ['address', 'book', 'company', 'game', 'government', 'movie', 'name', 'organization', 'position', 'scene']
        res = ['O'] + [p + '-' + l for p in ['B', 'M', 'E', 'S'] for l in clue_labels]
        # 此处也可改为从label.csv文件中读取，方便后面做模型推理，这里为方便理解，做如上修改。
        return res

    def _create_examples(self, lines):
        """See base class."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s" % (i) if 'id' not in line else line['id']
            text_a = tokenization.convert_to_unicode(line['text'])
            label = ['O'] * len(text_a)
            if 'label' in line:
                for l, words in line['label'].items():
                    for word, indices in words.items():
                        for index in indices:
                            if index[0] == index[1]:
                                label[index[0]] = 'S-' + l
                            else:
                                label[index[0]] = 'B-' + l
                                label[index[1]] = 'E-' + l
                                for i in range(index[0] + 1, index[1]):
                                    label[i] = 'M-' + l
            examples.append(
                InputExample(guid=guid, text_a=text_a, label=label))
        return examples

    def _create_examples_train(self, lines):
        """See base class."""
        examples = []
        for (i, line) in enumerate(lines):
            guid = "%s" % (i) if 'id' not in line else line['id']
            text_a = tokenization.convert_to_unicode(line['text'])
            label = ['O'] * len(text_a)
            if 'label' in line:
                for l, words in line['label'].items():
                    for word, indices in words.items():
                        for index in indices:
                            if index[0] == index[1]:
                                label[index[0]] = 'S-' + l
                            else:
                                label[index[0]] = 'B-' + l
                                label[index[1]] = 'E-' + l
                                for i in range(index[0] + 1, index[1]):
                                    label[i] = 'M-' + l
            examples.append(
                InputExample(guid=guid, text_a=text_a, label=label))
        return examples

五，修改label

1，修改 convert_single_example 方法

带有【modify】字样的为被修改部分；【add】字样的为新增部分。

def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer):
  if isinstance(example, PaddingInputExample):
    return InputFeatures(
        input_ids=[0] * max_seq_length,
        input_mask=[0] * max_seq_length,
        segment_ids=[0] * max_seq_length,
        # label_id=0,   【modify】
        label_id=[0] * max_seq_length,	# 【add】
        is_real_example=False)

  ... ... 
  
  tokens = []
  segment_ids = []
  label_ids = []    # 【add 】
  
  tokens.append("[CLS]")
  segment_ids.append(0)
  label_ids.append(0)	# 【add 】
  
  # for token in tokens_a:	【modify】
  #   tokens.append(token)	
  #   segment_ids.append(0)	
  for i, token in enumerate(tokens_a):	# 【add 】
      tokens.append(token)
      segment_ids.append(0)
      try:
          label_ids.append(label_map[example.label[i]])
      except Exception as e:
          print("[debug1]",e)
          label_ids.append(0)
  label_ids.append(0)	

  ... ...

  # Zero-pad up to the sequence length.
  while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)
    label_ids.append(0)		# 【add】

  assert len(input_ids) == max_seq_length
  assert len(input_mask) == max_seq_length
  assert len(segment_ids) == max_seq_length

  # label_id = label_map[example.label]		  # 【modify】

  if ex_index < 5:
    tf.logging.info("*** Example ***")
    tf.logging.info("guid: %s" % (example.guid))
    tf.logging.info("tokens: %s" % " ".join(
        [tokenization.printable_text(x) for x in tokens]))
    tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
    tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
    tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
    tf.logging.info("label: %s (id = %s)" % (example.label, label_ids)) # 【modify】

  feature = InputFeatures(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids,
      label_id=label_ids,   # 【modify】
      is_real_example=True)
  return feature

2，修改 file_based_convert_examples_to_features 方法

def file_based_convert_examples_to_features(
    examples, label_list, max_seq_length, tokenizer, output_file):
		... ...
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        # features["label_ids"] = create_int_feature([feature.label_id])    # 【modify】
        features["label_ids"] = create_int_feature(feature.label_id) 	# 【add】
		... ...
    writer.close()

3，修改 file_based_input_fn_builder 方法

def file_based_input_fn_builder(input_file, seq_length, is_training,
                                drop_remainder):
  name_to_features = {
    
    
      "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
      "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
      "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
      # "label_ids": tf.FixedLenFeature([], tf.int64),  # 【modify】
      "label_ids": tf.FixedLenFeature([seq_length], tf.int64),  # 【add】
      "is_real_example": tf.FixedLenFeature([], tf.int64),
  }
  ... ...

六，修改create_model

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
  """Creates a classification model."""
  model = modeling.BertModel(
      config=bert_config,
      is_training=is_training,
      input_ids=input_ids,
      input_mask=input_mask,
      token_type_ids=segment_ids,
      use_one_hot_embeddings=use_one_hot_embeddings)

  # output_layer = model.get_pooled_output()	# Ner任务不从池化层取数据
  ### output_layers .shape: [batch_size, seq_length, hidden_size]
  output_layers = model.get_sequence_output()
  
  seq_length = output_layer.shape[1].value
  hidden_size_double = output_layer.shape[2].value
  
  ### output_layer : [batch_size*seq_length, hidden_size]
  output_layer = tf.reshape(output_layer, shape=(-1, hidden_size_double))

  # 定义全连接的 weight , bias .
  output_weights = tf.get_variable(
      "output_weights", [num_labels, hidden_size_double],
      initializer=tf.truncated_normal_initializer(stddev=0.02))

  output_bias = tf.get_variable(
      "output_bias", [num_labels], initializer=tf.zeros_initializer())

  with tf.variable_scope("loss"):
    if is_training:
      output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

    ### logits.shape : [batch_size*seq_length, num_labels]
    logits = tf.matmul(output_layer, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    
    ### logits.shape : [batch_size， seq_length, num_labels]
    logits = tf.reshape(logits, shape=(-1, seq_length, num_labels))

    # 分别计算出 【probabilities 、 log_probs】
    probabilities = tf.nn.softmax(logits, axis=-1)
    log_probs = tf.nn.log_softmax(logits, axis=-1)

    # 将 softmax 的输出结果转化为 onehot 编码
    one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

    # 每个step的 loss 损失数
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    # 全局 loss 损失数
    loss = tf.reduce_mean(per_example_loss)

    return (loss, per_example_loss, logits, probabilities)

七，修改horovod多卡训练

此前笔者已有介绍，此处不在赘述，基于TensorFlow1使用Horovod实现BERT在单节点上的多GPU卡训练

八，使用混合精度训练和减少日志打印

1，使用混合精度训练

修改 optimization.py 文件

os.environ['TF_AUTO_MIXED_PRECISION_GRAPH_REWRITE_IGNORE_PERFORMANCE'] = '1'  #在导包后加入代码

...

optimizer = AdamWeightDecayOptimizer(
        learning_rate=learning_rate * sqrt(hvd.size()),
        # learning_rate=learning_rate,
        weight_decay_rate=0.01,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-6,
        exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
        
    # 使用混合精度的优化器
    optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer) # 【add】
    # 使用horovod多卡训练
    optimizer = hvd.DistributedOptimizer(optimizer)	# 【add】

注意：

建议使用 tf 1.14 及以上版本。
直接设置环境变量 TF_ENABLE_AUTO_MIXED_PRECISION 并不会启动混合精度。
在cuda9.0上几乎没有加速效果，在cuda9.2及以上版本才有加速效果。cuda9.2及以上版本包含cuBLAS 库，该库是深度学习混合精度优化框架。

2，减少日志打印

在run_classifier.sh文件中增加代码：

export OMPI_MCA_btl_vader_single_copy_mechanism='none'