NLP practice: Pytorch-based text classification entry practice

Table of contents

1. Preliminary preparation

1. Environment preparation

2. Load data

2. Code combat

1. Build a dictionary

2. Generate data batches and iterators

3. Define the model

4. Define the instance

5. Define training function and evaluation function

6. Split the dataset and run the model

3. Evaluate the model using the test dataset

Four. Summary


This is a simple text classification implementation using PyTorch. In this example, we will use the AG News dataset for text classification.

Text classification is generally divided into five steps: corpus, text clarity, word segmentation, text vectorization, and modeling.

1. Preliminary preparation

1. Environment preparation

For nlp projects, it is recommended that anaconda create a new environment. First, install the torchtext and portalocker libraries.
The version numbers I use are:
torchtext==0.15.1
portalocker==2.7.0
Note: similar versions are also available, and do not have to be exactly the same

Installation reference:

AG News (AG's News  Topic  Classification
Dataset) is a dataset widely used in text classification tasks, especially in the news field. The data set is collected and organized by AG's Corpus of News
Articles and contains four main categories: world, sports, business and technology

2. Load data

Like the previous object detection project, call the gpu

import torch
import torch.nn as nn
import torchvision
from torchvision import transforms, datasets
import os,PIL,pathlib,warnings

warnings.filterwarnings("ignore")             #忽略警告信息
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)

cuda

from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')      # 加载 AG News 数据集

 You can view the content in the corresponding directory (mine is C:\Users\Chen02\.cache\torch\text\datasets\AG_NEWS

2. Code combat

1. Build a dictionary

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer  = get_tokenizer('basic_english') # 返回分词器函数,训练营内“get_tokenizer函数详解”一文

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"]) # 设置默认索引,如果找不到单词,则会选择默认索引

 Detailed explanation of .build_vocab_from_iterator() function

torchtext.vocab.build_vocab_from_iterator The function of this function is to count the frequency of tokens from an iterable object and return a vocab (vocabulary dictionary) 

The above is the definition form of the official website API interface. There are five parameters, and the return value is a Vocab type instance. The five parameters are:
● iterator: an iterable object used to create a vocab (vocabulary dictionary).
● min_freq: minimum frequency. Only tokens whose occurrence frequency is greater than or equal to min_freq in the text will be retained
. specials: special tokens, a list of strings. 用于在词汇字典中添加一些特殊的token/标记,比如最常用的'<unk>',用于代表词汇字典中未存在的token,当然也可以用自己喜欢的符号来代替,具体的意义也Depends on who uses it.
●special_first: Indicates whether to put specials at the front of the dictionary, the default is True
●max_tokens: Limit the maximum length of the vocabulary dictionary. And the length of the specials list contained in this length

print(vocab(['here', 'is', 'an', 'example']))

[475, 21, 30, 5297]

text_pipeline  = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1
print(text_pipeline('here is the an example'))

[475, 21, 2, 30, 5297]

print(label_pipeline('10'))

9

2. Generate data batches and iterators

from torch.utils.data import DataLoader

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    
    for (_label, _text) in batch:
        # 标签列表
        label_list.append(label_pipeline(_label))
        
        # 文本列表
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        
        # 偏移量,即语句的总词汇量
        offsets.append(processed_text.size(0))
        
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list  = torch.cat(text_list)
    offsets    = torch.tensor(offsets[:-1]).cumsum(dim=0) #返回维度dim中输入元素的累计和
    
    return label_list.to(device), text_list.to(device), offsets.to(device)

# 数据加载器
dataloader = DataLoader(train_iter,
                        batch_size=8,
                        shuffle   =False,
                        collate_fn=collate_batch)

3. Define the model

Here we define the TextClassificationModel model, first embedding the text, and then performing mean aggregation on the results after sentence embedding.

from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        
        self.embedding = nn.EmbeddingBag(vocab_size,   # 词典大小
                                         embed_dim,    # 嵌入的维度
                                         sparse=False) # 
        
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

self.embedding.weight.data.uniform_(-initrange, initrange) This code is a method for initializing the weight of the word embedding layer (embedding layer) of the neural network under the PyTorch framework. Here, uniformly distributed random values ​​are used to initialize the weights. Specifically, its functions are as follows:

1 self.embedding : This is the word embedding layer in the neural network. The role of the word embedding layer is to map discrete word representations (usually integer indices) into fixed-size continuous vectors. These vectors capture the semantic relationship between words and serve as input to the network.
2 self.embedding.weight : This is the weight matrix of the word embedding layer, its shape is (vocab_size, embedding_dim), where vocab_size is the size of the vocabulary, and embedding_dim is the dimension of the embedding vector.
3 self.embedding.weight.data : This is the data part of the weight matrix, where we can directly manipulate its underlying tensor.
4. uniform_(-initrange, initrange) : This is an in-place operation, which is used to initialize the value of the weight matrix with a uniform distribution. The uniform distribution has a range of [-initrange, initrange], where initrange is a positive number.

Initializing the weight of the word embedding layer in this way can make the model have a certain randomness at the beginning of training, which helps to avoid problems such as gradient disappearance or gradient explosion. During training, these weights are continuously updated by an optimization algorithm to capture better word representations.

4. Define the instance

num_class  = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
em_size     = 64
model      = TextClassificationModel(vocab_size, em_size, num_class).to(device)

5. Define training function and evaluation function

import time

def train(dataloader):
    model.train()  # 切换为训练模式
    total_acc, train_loss, total_count = 0, 0, 0
    log_interval = 500
    start_time   = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        
        predicted_label = model(text, offsets)
        
        optimizer.zero_grad()                    # grad属性归零
        loss = criterion(predicted_label, label) # 计算网络输出和真实值之间的差距,label为真实值
        loss.backward()                          # 反向传播
        optimizer.step()  # 每一步自动更新
        
        # 记录acc与loss
        total_acc   += (predicted_label.argmax(1) == label).sum().item()
        train_loss  += loss.item()
        total_count += label.size(0)
        
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:1d} | {:4d}/{:4d} batches '
                  '| train_acc {:4.3f} train_loss {:4.5f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count, train_loss/total_count))
            total_acc, train_loss, total_count = 0, 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()  # 切换为测试模式
    total_acc, train_loss, total_count = 0, 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            
            loss = criterion(predicted_label, label)  # 计算loss值
            # 记录测试数据
            total_acc   += (predicted_label.argmax(1) == label).sum().item()
            train_loss  += loss.item()
            total_count += label.size(0)
            
    return total_acc/total_count, train_loss/total_count

6. Split the dataset and run the model

from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# 超参数
EPOCHS     = 10 # epoch
LR         = 5  # 学习率
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None

train_iter, test_iter = AG_NEWS() # 加载数据
train_dataset = to_map_style_dataset(train_iter)
test_dataset  = to_map_style_dataset(test_iter)
num_train     = int(len(train_dataset) * 0.95)

split_train_, split_valid_ = random_split(train_dataset,
                                          [num_train, len(train_dataset)-num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader  = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    val_acc, val_loss = evaluate(valid_dataloader)
    
    if total_accu is not None and total_accu > val_acc:
        scheduler.step()
    else:
        total_accu = val_acc
    print('-' * 69)
    print('| epoch {:1d} | time: {:4.2f}s | '
          'valid_acc {:4.3f} valid_loss {:4.3f}'.format(epoch,
                                           time.time() - epoch_start_time,
                                           val_acc,val_loss))

    print('-' * 69)

 

The function of torchtext.data.functional.to_map_style_dataset is to convert an iterative dataset (Iterable-style dataset) into a map-style dataset (Map-style dataset). This conversion allows us to access elements in the dataset more conveniently by index (eg: integer).

In PyTorch, datasets can be divided into two types: Iterable-style and Map-style. Iterable-style datasets implement the iter() method, which can iteratively access the elements in the dataset, but does not support index access. The Map-style data set implements the getitem() and len() methods, which can directly access specific elements through the index and obtain the size of the data set.

TorchText is an extension library for PyTorch that focuses on processing text data. The to_map_style_dataset function in torchtext.data.functional can help us convert an Iterable-style dataset into an easy-to-operate Map-style dataset. In this way, we can directly access specific samples in the dataset through the index, which simplifies data processing during training, validation and testing.

3. Evaluate the model using the test dataset

print('Checking the results of test dataset.')
test_acc, test_loss = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(test_acc))

Checking the results of test dataset.
test accuracy    0.905

Four. Summary

This article introduces a practical case of using PyTorch to implement text classification, using the AG News dataset for text classification. The implementation process is divided into four parts: preliminary preparation, code practice, model evaluation using test data sets, and summary. In the preliminary preparation, prepare the environment and load the data set; in the actual code combat, build a dictionary, generate data batches and iterators, and define the model; use the test data set to evaluate the model, and test the trained model; summarize and review the entire project. Through this article, you can understand the basic process of text classification and how to use PyTorch to implement a text classification model.

Here are a few learning experiences:

1. Environment preparation: It is recommended to use anaconda to create a new environment and install the required libraries in it.

2. Data loading: Load the data by calling the train_iter method of the AG_NEWS dataset, and you can view the data in the corresponding directory.

3. Build a dictionary: use the get_tokenizer function and build_vocab_from_iterator function to build a dictionary, you can set the default index as needed, and select the default index when the word cannot be found.

4. Generate data batches and iterators: Use the DataLoader function to generate data batches and iterators, where the collate_batch function is used to convert a single sample into a model input.

5. Define the model: Define the TextClassificationModel model, first embed the text, and then perform mean aggregation on the results after sentence embedding.

6. Model training and evaluation: train and evaluate the model, including steps such as defining loss function, defining optimizer, and loop training.

Guess you like

Origin blog.csdn.net/m0_62237233/article/details/130477513