Using Seq2Seq + Attention to process text summarization task

Seq2Seq

Seq2Seq模型已经发展成为了NLP中的一种标准的范式,NLP中的诸多任务皆使用Seq2Seq模型来处理,其中模型可以使用RNN、LSTM、GRU、Transformer和BERT等。

有关Seq2Seq模型的介绍,可见如下的几篇文章,写的都很好:

淺談神經機器翻譯 & 用 Transformer 與 TensorFlow 2 英翻中
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
Neural machine translation with attention

相信认真的阅读完上面的三篇教程后,大家对于Seq2Seq以及Seq2Seq + Attention模型做神经机器翻译(Neural Machine Translation)任务会有一个初步的认识。那么如果我们将文本摘要(Text Summarizarion)中的源文档看做是NMT中的源语言,将目标摘要看做是目标语言,同时使用真实的对应摘要做Teacher Forcing进行训练,那么便可以将模型简单的迁移到文本摘要任务中。

那么如何实现Seq2Seq + Attention来具体的实现文本摘要的自动生成呢?


手造 Seq2Seq + Attention

当获取到某一个文本摘要的数据集时我们需要先进行一些预处理工作,比如去除或是替换某些不需要的字符、分词、词干提取、构建词汇表等。

文本预处理

def process_text(text):
    # create a space between a word and the punctuation following it
    text = re.sub(r"([?.!,¿])", r" \1 ", text)
    text = re.sub(r'[" "]+', " ", text)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = text.strip()
    
    return text 
  • 构建词汇表
def build_vocabulary(texts,max_vocab_size):
    
    word_dict = dict()
    word_counts = collections.Counter(' '.join(texts).split(' ')).most_common()
    for word, _ in word_counts:
        word_dict[word] = len(word_dict)
                            
    reversed_dict = dict(zip(word_dict.values(), word_dict.keys()))
    
    return word_dict,reversed_dict
  • 分词
from nltk.tokenize import word_tokenize

def tokensize_text(text):
    words = []
    for w in word_tokenize(process_text(text)):
        words.append(w)
    return words

以上工作可以自己定义预处理的函数,以及使用NLTK库中的相关API,另外可以使用tensorflow_datasets中的相关API进行处理。

tensorflow_datasets 是tf2.0中有关数据集相关工作的很重要的一个模块,它内部已经集成了很多常用的数据集,例如文本摘要中的cnn_dailymail

使用tf.keras中的文本预处理API进行分词和词汇表的自动构建


import tensorflow as tf

def tokenize(text):
    text_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
    
    text_tokenizer.fit_on_texts(text)
    
    tensor = text_tokenizer.texts_to_sequences(text)
    
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding = 'post')
    
    return tensor,text_tokenizer

数据预处理工作结束后使用tf.data.Dataset.from_tensor_slices()构建tensorflow模型所需的数据集格式。

dataset = tf.data.Dataset.from_tensor_slices((article_tensor_train,title_tensor_train)).shuffle(buffer_size)

dataset = dataset.batch(batch_size,drop_remainder=True)

模型构建

这里我们只是用最简单的Encoder-Decoder + BahdanauAttentiond的方式进行构建模型,其中Encoder和Decoder都使用单层的GRU或LSTM。

  • Encoder
class Encoder(keras.Model):

    def __init__(self,vocab_size,embedding_dim,encoder_units,batch_size):
        super(Encoder,self).__init__()

        self.batch_size = batch_size
        self.encoder_units = encoder_units
        self.embedding = layers.Embedding(vocab_size,embedding_dim)
        self.gru = layers.GRU(self.encoder_units,
                                    return_sequences = True,
                                    return_state = True,
                                    recurrent_initializer = 'glorot_uniform')

    def call(self,x,hidden):
        x = self.embedding(x)
        output,state = self.gru(x,initial_state = hidden)
        return output,state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_size,self.encoder_units))
  • Attention
class BahdanauAttention(keras.Model):

    def __init__(self,units):
        super(BahdanauAttention,self).__init__()

        self.W1 = layers.Dense(units)
        self.W2 = layers.Dense(units)
        self.V = layers.Dense(1)

    def call(self,query,values):
        # hidden shape == (batch_size,hidden size)
        # hidden_with_time_axis shape == (batch_size,1,hidden size)
        hidden_with_time_axis = tf.expand_dims(query,1)

        # score shape == (batch_size,max_length,1)
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(hidden_with_time_axis)
        ))

        # attention_weights shape == (batch_size,max_length,1)
        attention_weights = tf.nn.softmax(score,axis = 1)

        # context_vector shape after sum == (batch_size,hidden_size)
        context_vector = attention_weights * values
        context_vector = tf.reduce_sum(context_vector,axis = 1)

        return context_vector,attention_weights
  • Decoder
class Decoder(keras.Model):
    
    def __init__(self,vocab_size,embedding_dim,decoder_units,batch_size):
        super(Decoder,self).__init__()

        self.batch_size = batch_size
        self.decoder_units = decoder_units
        self.embedding = layers.Embedding(vocab_size,embedding_dim)
        self.gru = layers.GRU(self.decoder_units,
                                    return_sequences = True,
                                    return_state = True,
                                    recurrent_initializer = 'glorot_uniform')
        self.fc = layers.Dense(vocab_size)

        self.attention = BahdanauAttention(self.decoder_units)

    # encoder_output shape = (batch_size,max_length,hidden_size)
    def call(self,x,hidden,encoder_output):
        context_vector,attention_weights = self.attention(hidden,encoder_output)

        # x shape after passing through embedding == (batch_size,1,embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size,1,embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector,1),x],axis = -1)

        # passing the concatenated vector to the GRU
        output,state = self.gru(x)

        # output shape == (batch_size * 1,hidden_size)
        output = tf.reshape(output,(-1,output.shape[2]))

        # output shape == (batch_size,vocab)
        x = self.fc(output)

        return x,state,attention_weights

训练过程

  • 损失项和优化器的选择
optimizer = keras.optimizers.Adam()
loss_object = keras.losses.SparseCategoricalCrossentropy(from_logits=True,reduction='none')
  • 损失函数
def loss_func(real,pred):
    mask = tf.math.logical_not(tf.math.equal(real,0))
    loss_ = loss_object(real,pred)
    
    mask = tf.cast(mask,dtype=loss_.dtype)
    loss_ *= mask
    
    return tf.reduce_mean(loss_)
  • 单步训练
@tf.function
def train_one_step(inp,targ,enc_hidden):
    batch_size,targ_length = targ.shape
    loss = 0
    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp,enc_hidden)
        dec_hidden = enc_hidden

        for t in range(targ_length - 1):
            dec_input = tf.expand_dims(targ[:,t],1) # using teacher forcing
            predictions,dec_hidden,_ = decoder(dec_input,dec_hidden,enc_output)
            loss += loss_func(targ[:,t + 1],predictions)

    batch_loss = loss / targ_length
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss,variables)
    optimizer.apply_gradients(zip(gradients,variables))

    return batch_loss
  • 训练
for epoch in range(num_epochs):
    start = time.time()
    
    enc_hidden = encoder.initialize_hidden_state()
    
    total_loss = 0
    
    for batch,(inp,targ) in enumerate(dataset):
        batch_loss = train_one_step(inp,targ,enc_hidden)
        total_loss += batch_loss
        
        if batch % 10 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,batch,batch_loss.numpy()))
    if (epoch + 1) % 2 ==0:
        ckpt_save_path = ckpt_manager.save()
        print ('Saving checkpoint for epoch {} at {}'.format(epoch + 1,ckpt_save_path))
        
print ('Epoch {} loss {:.4f}'.format(epoch + 1,total_loss / num_batch_per_epoch))
print ('Time taken for one epoch {} sec\n'.format(time.time() - start))

预测

当模型训练结束后便可以使用模型预测给定文本对应的摘要。

def evaluate(sentence):

    sentence = process_text(sentence)

    inputs = [article_word_dict[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length, padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, lstm_size))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([title_vocab_size], 0) # title_vocab_size为<start>对应的索引

    for t in range(max_title_length):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        predicted_id = tf.argmax(predictions[0]).numpy()

        result += title_reversed_dict[predicted_id] + ' '

        if predicted_id == title_vocab_size + 1:
            return result, sentence
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result

当然,由于在NMT中源语言和目标语言在句子的长度上不会差别很大,因此使用简单的Seq2Seq + Attention便可以学到较好的关于源语言文本和目标语言文本之间的注意力分布,在评估阶段往往可以取得不错的结果。但是源文本和对应的摘要之间长度往往差距较大,特别是针对于长文档摘要和多文档摘要来说,因此,在具体建模中往往需要一些其他的处理。

这只是针对于简单的NMT数据集来说,对于复杂的NMT任务,这样的简单模型显然是不够的,由于本文对于NMT不甚了解,难以给出相关的参考资料,有兴趣的可以自行Google~

在前面的模型中Encoder只使用了单层单向的GRU,为了提升效果我们也可以自定义双向的LSTM(或GRU)。

# Bi-LSTM
class Bi_LSTM(keras.layers.Layer):

    def __init__(self, units):
        super(Bi_LSTM,self).__init__()
        
        self.units = units
        self.lstm = keras.layers.Bidirectional(keras.layers.LSTM(self.units,
                                            return_state = True,
                                            return_sequences = True,
                                            recurrent_initializer = 'glorot_uniform'),
                                            merge_mode = "concat")
        
        self.fc = keras.layers.Dense(self.units,activation = 'relu')
        
    def call(self, x):
        
        output,fw_state_h,fw_state_c,rev_state_h,rev_state_c= self.lstm(x)
      
        state_h = keras.layers.concatenate([fw_state_h, rev_state_h])
        state_c = keras.layers.concatenate([fw_state_c, rev_state_c])
        
        return output, state_h, state_c

# Encoder
class Encoder(keras.Model):
    
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_size,\
                 dropout_rate,is_trainable = True, forward_only = False, glove = False):
        super(Encoder,self).__init__()
        
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.enc_units = enc_units
        self.batch_size = batch_size
        self.forward_only = forward_only
        self.glove = glove
        self.dropout_rate = dropout_rate
        self.is_trainable = is_trainable
        
        self.embedding = Emb(self.vocab_size, self.embedding_dim, self.forward_only, self.glove)
        
        self.dropout = DropoutLayer(self.dropout_rate, is_trainable)
        
        self.lstm = Bi_LSTM(self.enc_units)
            
    def call(self, x):
        x = self.embedding(x) 
        
        x = self.dropout(x)
        
        x, state_h, state_c = self.lstm(x)  
                
        return x, state_h, state_c
    
    def initialize_hidden_state(self):
        
        return tf.zeros((self.batch_size,self.enc_units))

而且为了获取更好的词嵌入信息,在Embedding层的定义中也可以使用Glove等预训练好的权重信息,后面会给出相关的介绍。

简单粗暴 TensorFlow 2.0


自己亲手实现模型的每个模块可以更好的了解模型的运行机制,但是更多的时候我们只是将其作为基准模型来进行对比,那么"手工造轮子"就显得有些多余了。因此,我们可以使用已有的库通过调用包装好的API进行模型的构建,下面主要介绍三个和Tensoeflow2.0相关的Seq2Seq的模型实现库。

tensorflow_addons

github
在这里插入图片描述

tensorflow2.0中没有了tf.contrib这个库,因此Google将其中的大部分API实现都迁移到了addons这个项目中,用户可以使用简单的几行代码进行模型的搭建。

例如搭建一个简单的Decoder:

import tensorflow_addons as tfa

# Build RNN
#   encoder_outputs: [max_time, batch_size, num_units]
#   encoder_state: [batch_size, num_units]
encoder = tf.keras.layers.LSTM(num_units, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_emb_inp)
encoder_state = [state_h, state_c]

# Sampler
sampler = tfa.seq2seq.sampler.TrainingSampler()

# Decoder
decoder_cell = tf.keras.layers.LSTMCell(num_units)
projection_layer = tf.keras.layers.Dense(num_outputs)
decoder = tfa.seq2seq.BasicDecoder(
    decoder_cell, sampler, output_layer=projection_layer)

outputs, _, _ = decoder(
    decoder_emb_inp,
    initial_state=encoder_state,
    sequence_length=decoder_lengths)
logits = outputs.rnn_output

使用包装好的多种Attention机制

import tensorflow_addons as tfa

attention_mechanism = tfa.seq2seq.LuongAttention(
    num_units,
    encoder_state,
    memory_sequence_length=encoder_sequence_length)

decoder_cell = tfa.seq2seq.AttentionWrapper(
    decoder_cell, attention_mechanism,
    attention_layer_size=num_units)

以及构建推断阶段使用的BeamSearchDecoder


import tensorflow_addons as tfa

# Replicate encoder infos beam_width times
decoder_initial_state = tfa.seq2seq.tile_batch(
    encoder_state, multiplier=hparams.beam_width)

# Define a beam-search decoder
decoder = tfa.seq2seq.BeamSearchDecoder(
    cell=decoder_cell,
    beam_width=beam_width,
    output_layer=projection_layer,
    length_penalty_weight=0.0,
    coverage_penalty_weight=0.0)

# decoding
outputs, _ = decoder(
    embedding_decoder,
    start_tokens=start_tokens,
    end_token=end_token,
    initial_state=decoder_initial_state)
  • 详细的内容可以阅读官方主页的API介绍和github中的源代码实现,后续阅读完源代码后希望可以给出一个简单的介绍,目前网上难以找到相关内容的文章~
  • 目前不支持Windows系统上的安装

OpenNMT-tf

github

OpenNMT是哈佛大学自然语言处理研究组开源的机器翻译系统,目前有tensorflow和pytorch两个版本,其中tensorflow版本已经实现了使用tensorflow2.0重构,可以很方便的通过官方给出的调用方式处理NMT任务,但是如果想要使用其中的模块定制自己的文本摘要模型,就需要阅读源代码,然后选择性的使用。

例如可以通过几行命令来构建一个基本的Seq2Seq模型:

opennmt.models.SequenceToSequence(
    source_inputter=opennmt.inputters.ParallelInputter(
        [opennmt.inputters.WordEmbedder(embedding_size=256),
         opennmt.inputters.WordEmbedder(embedding_size=256)],
        reducer=opennmt.layers.ConcatReducer(axis=-1)),
    target_inputter=opennmt.inputters.WordEmbedder(embedding_size=512),
    encoder=opennmt.encoders.SelfAttentionEncoder(num_layers=6),
    decoder=opennmt.decoders.AttentionalRNNDecoder(
        num_layers=4,
        num_units=512,
        attention_mechanism_class=tfa.seq2seq.LuongAttention),
    share_embeddings=opennmt.models.EmbeddingsSharingLevel.TARGET)
  • 详细的内容可见github和官方主页~
  • 目前不支持Windows系统上的安装

Headliner

Headliner是今天偶然看到的一个Seq2Seq库,在痛苦的日子中发现的一个比较好的资源,便快速的试验了使用提供的API进行简单的摘要生成。Headliner目前提供了三种模型:

  • Basic Seq2Seq model
  • Seq2Seq + Attention model
  • Transformer model

话不多说,让我们直观的通过代码感受一下使用它可以多简洁的构建一个基准模型。

使用同样的数据集以及部分相同的数据预处理工作:

def unicode_to_ascii(text):
    return ''.join(c for c in unicodedata.normalize('NFD',text) if unicodedata.category(c) != 'Mn')

# 文本预处理
def process_text(text):
    
    text = unicode_to_ascii(text.lower().strip())
    # create a space between a word and the punctuation following it
    text = re.sub(r"([?.!,¿])", r" \1 ", text)
    text = re.sub(r'[" "]+', " ", text)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    text = re.sub(r"[^a-zA-Z?.!,¿]+", " ", text)
    text = text.strip()
    
    return text

def create_dataset(path,toy = True):

    with open(path,'r') as f:
        data = json.load(f)
        
    text = [process_text(line) for line in data]
    
    if toy:
        return text[:100]
    else:
        return text

text = create_dataset(article_path)
title = create_dataset(title_path)

也可以使用提供的Preprocessor来进行自动化的预处理:

from headliner.preprocessing import Preprocessor

preprocessor = Preprocessor(lower_case = True)

train_prep = [preprocessor(line) for line in train]

接着使用SubwordTextEncoderVectorizer进行分词和向量化:

from tensorflow_datasets.core.features.text import SubwordTextEncoder
from headliner.preprocessing import Vectorizer

inputs_prep = [t[0] for t in train_prep]
targets_prep = [t[1] for t in train_prep]

tokenizer_input = SubwordTextEncoder.build_from_corpus(inputs_prep, target_vocab_size=2**13)
tokenizer_target = SubwordTextEncoder.build_from_corpus(targets_prep, 
                                                        target_vocab_size=2**13, 
                                                        reserved_tokens=[preprocessor.start_token,preprocessor.end_token])

vectorizer = Vectorizer(tokenizer_input, tokenizer_target)

进行完上述的工作后,只需要调用四行代码就可以实现模型的使用,是不是很方便呢?

from headliner.model.summarizer_basic import SummarizerBasic

summarizer = SummarizerBasic(embedding_size=64, max_prediction_len=50)
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=8,
                  steps_per_epoch=10,
                  max_vocab_size_encoder=10000,
                  max_vocab_size_decoder=10000,
                  tensorboard_dir='/tmp/tensorboard',
                  model_save_path='/tmp/summarizer_tensorboard',
                  steps_to_log=50)

trainer.train(summarizer, train, num_epochs=50, val_data=test)

同时可以开启tensorboard查看模型的运行情况:

# Start tensorboard
%load_ext tensorboard
%tensorboard --logdir /tmp/summarizer_tensorboard
发布了267 篇原创文章 · 获赞 91 · 访问量 19万+

猜你喜欢

转载自blog.csdn.net/Forlogen/article/details/102611092
今日推荐