任务4:论文种类分类
这部分内容作者还没有完成,先放出来大家参考,作者会继续补充,不喜勿喷
4.1 任务说明
学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类;
学习内容:使用论文标题完成类别分类;
学习成果:学会文本分类的基本方法、TF-IDF等
4.3 文本分类思路
思路1:TF-IDF+机器学习分类器
直接使用TF-IDF对文本提取特征,使用分类器进行分类,分类器的选择上可以使用SVM、LR、XGboost等
思路2:FastText
FastText是入门款的词向量,利用Facebook提供的FastText工具,可以快速构建分类器
思路3:WordVec+深度学习分类器
WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM。
思路4:Bert词向量
Bert是高配款的词向量,具有强大的建模学习能力。
import pandas as pd
import numpy as np
import re
import json
import matplotlib.pyplot as plt
data = [] #初始化
#使用with语句优势:1.自动关闭文件句柄;2.自动显示(处理)文件读取数据异常
with open(r'arxiv-metadata-oai-2019.json', 'r') as f:
for idx, line in enumerate(f):
d = json.loads(line)
d = {
'title': d['title'], 'categories': d['categories'], 'abstract': d['abstract']}
data.append(d)
# 选择部分数据
if idx > 200000:
break
data = pd.DataFrame(data) #将list变为dataframe格式,方便使用pandas进行分析
data.head()
title | categories | abstract | |
---|---|---|---|
0 | Remnant evolution after a carbon-oxygen white ... | astro-ph | We systematically explore the evolution of t... |
1 | Cofibrations in the Category of Frolicher Spac... | math.AT | Cofibrations are defined in the category of ... |
2 | Torsional oscillations of longitudinally inhom... | astro-ph | We explore the effect of an inhomogeneous ma... |
3 | On the Energy-Momentum Problem in Static Einst... | gr-qc | This paper has been removed by arXiv adminis... |
4 | The Formation of Globular Cluster Systems in M... | astro-ph | The most massive elliptical galaxies show a ... |
为了方便数据的处理,我们可以将标题和摘要拼接一起完成分类。
data['text'] = data['title'] + data['abstract']
data['text'] = data['text'].apply(lambda x: x.replace('\n',' '))
data['text'].head()
0 Remnant evolution after a carbon-oxygen white ...
1 Cofibrations in the Category of Frolicher Spac...
2 Torsional oscillations of longitudinally inhom...
3 On the Energy-Momentum Problem in Static Einst...
4 The Formation of Globular Cluster Systems in M...
Name: text, dtype: object
data['text'] = data['text'].apply(lambda x: x.lower())
data['text'].head()
0 remnant evolution after a carbon-oxygen white ...
1 cofibrations in the category of frolicher spac...
2 torsional oscillations of longitudinally inhom...
3 on the energy-momentum problem in static einst...
4 the formation of globular cluster systems in m...
Name: text, dtype: object
data = data.drop(['abstract', 'title'], axis=1)
原始论文有可能有多个类别
# 多个类别,包含子分类
data['categories'] = data['categories'].apply(lambda x : x.split(' '))
data['categories'].head()
0 [astro-ph]
1 [math.AT]
2 [astro-ph]
3 [gr-qc]
4 [astro-ph]
Name: categories, dtype: object
# 单个类别,不包含子分类
data['categories_single'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])
data['categories_single'].head()
0 [astro-ph]
1 [math]
2 [astro-ph]
3 [gr-qc]
4 [astro-ph]
Name: categories_single, dtype: object
将类别进行编码,这里类别是多个,所以需要多编码:
import sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_single'].iloc[:])#获取标签
data_label[:5]
array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
思路1
思路1使用TFIDF提取特征,限制最多4000个单词:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])
由于这里是多标签分类,可以使用sklearn的多标签分类进行封装:
# 划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,
test_size = 0.2,random_state = 1)
# 构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test)))
precision recall f1-score support
0 0.00 0.00 0.00 0
1 0.00 0.00 0.00 1
2 0.00 0.00 0.00 0
3 0.91 0.85 0.88 3625
4 0.00 0.00 0.00 4
5 0.00 0.00 0.00 0
6 0.00 0.00 0.00 1
7 0.00 0.00 0.00 0
8 0.77 0.76 0.77 3801
9 0.84 0.89 0.86 10715
10 0.00 0.00 0.00 0
11 0.00 0.00 0.00 186
12 0.44 0.41 0.42 1621
13 0.00 0.00 0.00 1
14 0.75 0.59 0.66 1096
15 0.61 0.80 0.69 1078
16 0.90 0.19 0.32 242
17 0.53 0.67 0.59 1451
18 0.71 0.54 0.62 1400
19 0.88 0.84 0.86 10243
20 0.40 0.09 0.15 934
21 0.00 0.00 0.00 1
22 0.87 0.03 0.06 414
23 0.48 0.65 0.55 517
24 0.37 0.33 0.35 539
25 0.00 0.00 0.00 1
26 0.60 0.42 0.49 3891
27 0.00 0.00 0.00 0
28 0.82 0.08 0.15 676
29 0.86 0.12 0.21 297
30 0.80 0.40 0.53 1714
31 0.00 0.00 0.00 4
32 0.56 0.65 0.60 3398
33 0.00 0.00 0.00 0
micro avg 0.76 0.70 0.72 47851
macro avg 0.39 0.27 0.29 47851
weighted avg 0.75 0.70 0.71 47851
samples avg 0.74 0.76 0.72 47851
C:\Users\zhoukaiwei\AppData\Roaming\Python\Python38\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\zhoukaiwei\AppData\Roaming\Python\Python38\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\zhoukaiwei\AppData\Roaming\Python\Python38\site-packages\sklearn\metrics\_classification.py:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in samples with no predicted labels. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))