GRU(Gate Recurrent Unit)是循环神经网络(Recurrent Neural Network, RNN)的一种。和LSTM(Long-Short Term Memory)一样,也是为了解决长期记忆和反向传播中的梯度等问题而提出来的。
相比LSTM,使用GRU能够达到相当的效果,并且相比之下更容易进行训练,能够很大程度上提高训练效率,因此很多时候会更倾向于使用GRU,其中GRU输入输出的结构与普通的RNN相似,其中的内部思想与LSTM相似
内部结构图:
import random
import jieba
import pandas as pd
import numpy as np
# 引入需要的模块
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Embedding, GRU
from keras.models import Sequential
# 整个过程包括:语料加载,分词和去停用词,数据预处理
# 使用 LSTM 分类,使用 GRU 分类
# 加载停用词
stopwords = pd.read_csv('./NB_SVM/stopwords.txt', index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8')
print("stopwords:\n", stopwords.head())
stopwords = stopwords['stopword'].values
# 加载语料,语料是4个已经分好类的 csv 文件
laogong_df = pd.read_csv('./NB_SVM/beilaogongda.csv', encoding='utf-8', sep=',', index_col=[0])
laopo_df = pd.read_csv('./NB_SVM/beilaopoda.csv', encoding='utf-8', sep=',', index_col=[0])
erzi_df = pd.read_csv('./NB_SVM/beierzida.csv', encoding='utf-8', sep=',', index_col=[0])
nver_df = pd.read_csv('./NB_SVM/beinverda.csv', encoding='utf-8', sep=',', index_col=[0])
# 删除语料的nan行
laogong_df.dropna(inplace=True)
laopo_df.dropna(inplace=True)
erzi_df.dropna(inplace=True)
nver_df.dropna(inplace=True)
print("laogong_df:\n", laogong_df.head())
print("laopo_df:\n", laopo_df.head())
print("erzi_df:\n", erzi_df.head())
print("nver_df:\n", nver_df.head())
# 转换
laogong = laogong_df.segment.values.tolist()
laopo = laopo_df.segment.values.tolist()
erzi = erzi_df.segment.values.tolist()
nver = nver_df.segment.values.tolist()
# 定义分词和打标签函数preprocess_text
# 参数content_lines即为上面转换的list
# 参数sentences是定义的空list,用来储存打标签之后的数据
# 参数category 是类型标签
jieba.add_word("报警人")
jieba.add_word("防护装备")
jieba.add_word("防护设备")
jieba.suggest_freq(("人", "称"), tune=True)
def preprocess_text(content_lines, sentences, category):
for line in content_lines:
try:
segs = jieba.lcut(line)
segs = [v for v in segs if not str(v).isdigit()] # 去数字
segs = list(filter(lambda x: x.strip(), segs)) # 去左右空格
segs = list(filter(lambda x: len(x) > 1, segs)) # 长度为1的字符
segs = list(filter(lambda x: x not in stopwords, segs)) # 去掉停用词
sentences.append((" ".join(segs), category)) # 打标签
except Exception:
print(line)
continue
# 调用函数、生成训练数据
sentences = []
preprocess_text(laogong, sentences,0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)
# 打散数据,生成更可靠的训练集
random.shuffle(sentences)
# 控制台输出前10条数据,观察一下
for sentence in sentences[:10]:
print(sentence[0], sentence[1])
# 所有特征和对应标签
all_texts = [sentence[0] for sentence in sentences]
all_labels = [sentence[1] for sentence in sentences]
# 使用 GRU 进行文本分类
# 预定义变量
MAX_SEQUENCE_LENGTH = 100 # 最大序列长度
EMBEDDING_DIM = 200 # embdding 维度
VALIDATION_SPLIT = 0.16 # 验证集比例
TEST_SPLIT = 0.2 # 测试集比例
# keras的sequence模块文本序列填充
'''
Tokenizer是一个将文本向量化,转换成序列的类。用来文本处理的分词、嵌入
keras.preprocessing.text.Tokenizer(num_words=None,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
split=' ',
char_level=False,
oov_token=None,
document_count=0)
参数说明:
num_words: 默认是None处理所有字词,但是如果设置成一个整数,那么最后返回的是最常见的、出现频率最高的num_words个字词。一共保留 num_words-1 个词。
filters: 过滤一些特殊字符,默认上文的写法就可以了。
lower: 是否全部转为小写。
split: 分词的分隔符字符串,默认为空格。因为英文分词分隔符就是空格。
char_level: 分字。
oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls
相关的类方法:
方法 参数 返回值
fit_on_texts(texts) texts:要用以训练的文本列表 -
texts_to_sequences(texts) texts:待转为序列的文本列表 序列的列表,列表中每个序列对应于一段输入文本
属性:
word_index: 字典,将单词(字符串)映射为它们的排名或者索引。仅在调用fit_on_texts之后设置。
'''
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
print("data:", data)
print("labels:", labels)
# 数据切分
p1 = int(len(data) * (1 - VALIDATION_SPLIT - TEST_SPLIT))
p2 = int(len(data) * (1 - TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(GRU(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
# 训练
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128)
model.save('gru.h5')
print(model.metrics_names)
print(model.evaluate(x_test, y_test))
原文:
https://soyoger.blog.csdn.net/article/details/108729405