HKUST Xunfei-Text Classification and Keyword Extraction Challenge Based on Paper Abstracts (DataWhale-Camp)

1. Question information

Submission address: https://challenge.xfyun.cn/topic/info?type=abstract-of-the-paper&ch=ymfk4uU

Project title:

  • Text Classification and Keyword Extraction Challenge Algorithm Challenge Competition Based on Paper Abstracts
  • Keyword extraction: generate keywords based on the abstract
  • Text classification: judging whether it is medical
    insert image description hereinsert image description here

2. Solutions

Provide the model:

  • Mathematics and word frequency statistics: tfidf+logistic regression
  • Machine learning training: bert version
  • Large model fine-tuning: chatglm+lora

operating environment

  • Python 3.10.12
  • The requirements.txt of the root directory or the conda export of docx/.yaml

final grade

  • RK17, 1200 teams signed up, 121 teams submitted, 17/121 = about 14%.
  • Score: 0.435
    task1 is supplemented with a fine-tuned GLM large model + bert.
    task2 runs 3 rounds with pure bert.
    insert image description here

2.1 Propeller Baseline (provide code)

  • To run through the whole process quickly, we deploy the Baseline of this tutorial on the online platform based on Baidu AI Studio, and you can fork to run the code with one click, submit the results, and see the results.
  • There is nothing to say, just one click to run, and AIStudio can run.
  • The local m1pro can run out in about 10 seconds.
  • In fact, the actual score is only about 0.2, because the sponsor has given too much data on the A list, and the results of the keywords are directly given, so everyone puts the same one on it, naturally everyone is 0.9x, and I feel that there is no reference value.
  • Get fake grades after submitting.
    insert image description here
基于论文摘要的文本分类与关键词抽取挑战赛  
https://challenge.xfyun.cn/topic/info?type=abstract-of-the-paper&ch=ZuoaKcY
#%% md
![](https://ai-studio-static-online.cdn.bcebos.com/bc8c545638eb4200a68836ed741b6fe7d75108e9009d443b8de5b33fb8e0fa55)


## 3. 赛题解析
实践任务
本任务分为两个子任务:
1. 从论文标题、摘要作者等信息,判断该论文是否属于医学领域的文献。
2. 从论文标题、摘要作者等信息,提取出该论文关键词。

第一个任务看作是一个文本二分类任务。机器需要根据对论文摘要等信息的理解,将论文划分为医学领域的文献和非医学领域的文献两个类别之一。第二个任务看作是一个文本关键词识别任务。机器需要从给定的论文中识别和提取出与论文内容相关的关键词。

数据集解析
训练集与测试集数据为CSV格式文件,各字段分别是标题、作者和摘要。Keywords为任务2的标签,label为任务1的标签。训练集和测试集都可以通过pandas读取。

## 4.实践思路&baseline
### 实践思路
本赛题可以分为两个子任务:
1. 从论文标题、摘要作者等信息,判断该论文是否属于医学领域的文献。
2. 从论文标题、摘要作者等信息,提取出该论文关键词。


#### 任务一:文本二分类
第一个任务看作是一个文本二分类任务。机器需要根据对论文摘要等信息的理解,将论文划分为医学领域的文献和非医学领域的文献两个类别之一。

针对文本分类任务,可以提供两种实践思路,一种是使用传统的特征提取方法(如TF-IDF/BOW)结合机器学习模型,另一种是使用预训练的BERT模型进行建模。使用特征提取 + 机器学习的思路步骤如下:
1. 数据预处理:首先,对文本数据进行预处理,包括文本清洗(如去除特殊字符、标点符号)、分词等操作。可以使用常见的NLP工具包(如NLTK或spaCy)来辅助进行预处理。
2. 特征提取:使用TF-IDF(词频-逆文档频率)或BOW(词袋模型)方法将文本转换为向量表示。TF-IDF可以计算文本中词语的重要性,而BOW则简单地统计每个词语在文本中的出现次数。可以使用scikit-learn库的TfidfVectorizer或CountVectorizer来实现特征提取。
3. 构建训练集和测试集:将预处理后的文本数据分割为训练集和测试集,确保数据集的样本分布均匀。
4. 选择机器学习模型:根据实际情况选择适合的机器学习模型,如朴素贝叶斯、支持向量机(SVM)、随机森林等。这些模型在文本分类任务中表现良好。可以使用scikit-learn库中相应的分类器进行模型训练和评估。
5. 模型训练和评估:使用训练集对选定的机器学习模型进行训练,然后使用测试集进行评估。评估指标可以选择准确率、精确率、召回率、F1值等。
6. 调参优化:如果模型效果不理想,可以尝试调整特征提取的参数(如词频阈值、词袋大小等)或机器学习模型的参数,以获得更好的性能。  


Baseline中我们选择使用BOW将文本转换为向量表示,选择逻辑回归模型来完成训练和评估  
代码演示如下:
# 获取前置依赖
!pip install nltk

# 导入pandas用于读取表格数据
import pandas as pd

# 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳
from sklearn.feature_extraction.text import CountVectorizer

# 导入LogisticRegression回归模型
from sklearn.linear_model import LogisticRegression

# 过滤警告消息
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)


# 读取数据集
train = pd.read_csv('data/train.csv')
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test = pd.read_csv('data/test.csv')
test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')


# 提取文本特征,生成训练集与测试集
train['text'] = train['title'].fillna('') + ' ' +  train['author'].fillna('') + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text'] = test['title'].fillna('') + ' ' +  test['author'].fillna('') + ' ' + test['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')

vector = CountVectorizer().fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])


# 引入模型
model = LogisticRegression()

# 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果
model.fit(train_vector, train['label'])

# 利用模型对测试集label标签进行预测
test['label'] = model.predict(test_vector)

# 生成任务一推测结果
test[['uuid', 'Keywords', 'label']].to_csv('submit_task1.csv', index=None)

#### 任务二:关键词提取
论文关键词划分为两类:
- 在标题和摘要中出现的关键词
- 没有在标题和摘要中出的关键词

在标题和摘要中出现的关键词:这些关键词是文本的核心内容,通常在文章的标题和摘要中出现,用于概括和提炼文本的主题或要点。对于提取这类关键词,可以采用以下方法:
  - 词频统计:统计标题和摘要中的词频,选择出现频率较高的词语作为关键词。同时设置停用词去掉价值不大、有负作用的词语。
  - 词性过滤:根据文本的词性信息,筛选出名词、动词、形容词等词性的词语作为关键词。
  - TF-IDF算法:计算词语在文本中的词频和逆文档频率,选择TF-IDF值较高的词语作为关键词。

没有在标题和摘要中出现的关键词:这类关键词可能在文本的正文部分出现,但并没有在标题和摘要中提及。要提取这些关键词,可以考虑以下方法:
  - 文本聚类:将文本划分为不同的主题或类别,提取每个主题下的关键词。
  - 上下文分析:通过分析关键词周围的上下文信息,判断其重要性和相关性。
  - 基于机器学习/深度学习的方法:使用监督学习或无监督学习的方法训练模型,从文本中提取出未出现在标题和摘要中的关键词。
# 引入分词器
from nltk import word_tokenize, ngrams

# 定义停用词,去掉出现较多,但对文章不关键的词语
stops = [
    'will', 'can', "couldn't", 'same', 'own', "needn't", 'between', "shan't", 'very',
     'so', 'over', 'in', 'have', 'the', 's', 'didn', 'few', 'should', 'of', 'that', 
     'don', 'weren', 'into', "mustn't", 'other', 'from', "she's", 'hasn', "you're",
     'ain', 'ours', 'them', 'he', 'hers', 'up', 'below', 'won', 'out', 'through',
     'than', 'this', 'who', "you've", 'on', 'how', 'more', 'being', 'any', 'no',
     'mightn', 'for', 'again', 'nor', 'there', 'him', 'was', 'y', 'too', 'now',
     'whom', 'an', 've', 'or', 'itself', 'is', 'all', "hasn't", 'been', 'themselves',
     'wouldn', 'its', 'had', "should've", 'it', "you'll", 'are', 'be', 'when', "hadn't",
     "that'll", 'what', 'while', 'above', 'such', 'we', 't', 'my', 'd', 'i', 'me',
     'at', 'after', 'am', 'against', 'further', 'just', 'isn', 'haven', 'down',
     "isn't", "wouldn't", 'some', "didn't", 'ourselves', 'their', 'theirs', 'both',
     're', 'her', 'ma', 'before', "don't", 'having', 'where', 'shouldn', 'under',
     'if', 'as', 'myself', 'needn', 'these', 'you', 'with', 'yourself', 'those',
     'each', 'herself', 'off', 'to', 'not', 'm', "it's", 'does', "weren't", "aren't",
     'were', 'aren', 'by', 'doesn', 'himself', 'wasn', "you'd", 'once', 'because', 'yours',
     'has', "mightn't", 'they', 'll', "haven't", 'but', 'couldn', 'a', 'do', 'hadn',
     "doesn't", 'your', 'she', 'yourselves', 'o', 'our', 'here', 'and', 'his', 'most',
     'about', 'shan', "wasn't", 'then', 'only', 'mustn', 'doing', 'during', 'why',
     "won't", 'until', 'did', "shouldn't", 'which'
]

# 定义方法按照词频筛选关键词

def extract_keywords_by_freq(title, abstract):
    ngrams_count = list(ngrams(word_tokenize(title.lower()), 2)) + list(ngrams(word_tokenize(abstract.lower()), 2))
    ngrams_count = pd.DataFrame(ngrams_count)
    ngrams_count = ngrams_count[~ngrams_count[0].isin(stops)]
    ngrams_count = ngrams_count[~ngrams_count[1].isin(stops)]
    ngrams_count = ngrams_count[ngrams_count[0].apply(len) > 3]
    ngrams_count = ngrams_count[ngrams_count[1].apply(len) > 3]
    ngrams_count['phrase'] = ngrams_count[0] + ' ' + ngrams_count[1]
    ngrams_count = ngrams_count['phrase'].value_counts()
    ngrams_count = ngrams_count[ngrams_count > 1]
    return list(ngrams_count.index)[:6]

## 对测试集提取关键词   

test_words = []
for row in test.iterrows():
    # 读取第每一行数据的标题与摘要并提取关键词
    prediction_keywords = extract_keywords_by_freq(row[1].title, row[1].abstract)
    # 利用文章标题进一步提取关键词
    prediction_keywords = [x.title() for x in prediction_keywords]
    # 如果未能提取到关键词
    if len(prediction_keywords) == 0:
        prediction_keywords = ['A', 'B']
    test_words.append('; '.join(prediction_keywords))
    
test['Keywords'] = test_words
test[['uuid', 'Keywords', 'label']].to_csv('submit_task2.csv', index=None)

完整baseline运行后a榜分数在0.97655左右

2.2 Bert and tuning parameters

  • Baidu Flying Paddle cannot use torch, neither Aliyun Tianchi nor Google Kaggle can obtain the bert model from huggingface through the Internet.
    insert image description hereinsert image description here

  • Finally, using Google Colab, transfrom was successfully run after manual installation. I have to mention that the network speed of colab is really fast, 100-200Mb/s.
    insert image description here

  • You can also install the corresponding environment locally and run it. After modifying Epoch to 1, you can run successfully and get the result.
    insert image description here

  • Use m1pro to run for 36+18 minutes, get bert for 3 rounds of epoch, and submit to get a score of 0.42670.

  • I rented a server E5 with 36 cores and 72 threads, running with 10 rounds of epoch.
    insert image description here
    insert image description here

2.3 chatGLM+lora large model

3. About DataWhale-NLP

Registration period: 2023/7/13 - 2023/7/20

The reason is the postgraduate entrance examination, the school has an internship class next semester, and I want to see if I can get an internship certificate.

  • But it seems to be quite troublesome. You want the first prize or the third prize in 2 directions or a promotional ambassador (I feel that it takes time to play the game, and it is troublesome to pull people's heads)
  • In addition, the school may have to stamp the school's own report, which may not be used.
  • So I ran away later, didn't adjust the code much, just messed around with it, it's more important to review for the postgraduate entrance examination.

For a simple record, it may take nearly half a day to sign up to fish fishing team.

insert image description here

About the publicity ambassador:

  • Method 1: Share our event tweets/event posters to the school’s public account/community/moments, etc. (recommendation index: ⭐⭐) Method 2:
    Recreate the event tweets/event posters and share them on the school’s public account /Community/Moments, etc. (recommendation index: ⭐⭐⭐)
    Method 3: Share tweets or posters to personal self-media (recommendation index: ⭐⭐⭐)
  • 10 honorary certificates 20
    honorary certificates + stickers + decorative lanyards + keychains 30
    honorary certificates + stickers + mouse pad 50
    honorary certificates + stickers + stand + resume guidance 60
    honorary certificates + stickers + sweaters + resume guidance

Guess you like

Origin blog.csdn.net/qq_33957603/article/details/131841264