论文数据分析-2(作者数据统计)

任务2:论文作者统计

接着上一节继续,需要统计所有论文作者的信息,废话不多说,直接上代码。

import numpy as np
import pandas as pd
import re 
import json
import matplotlib.pyplot as plt
data = []
with open(r'arxiv-metadata-oai-2019.json', 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {
    
    'authors': d['authors'], 'categories': d['categories'], 'authors_parsed': d['authors_parsed']}
        data.append(d)        
data = pd.DataFrame(data)
data.head()
authors categories authors_parsed
0 Sung-Chul Yoon, Philipp Podsiadlowski and Step... astro-ph [[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1 B. Dugmore and PP. Ntumba math.AT [[Dugmore, B., ], [Ntumba, PP., ]]
2 T.V. Zaqarashvili and K Murawski astro-ph [[Zaqarashvili, T. V., ], [Murawski, K, ]]
3 Sezgin Aygun, Ismail Tarhan, Husnu Baysal gr-qc [[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4 Antonio Pipino (1,3), Thomas H. Puzia (2,4), a... astro-ph [[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

数据统计

1.统计所有作者姓名出现频率的Top10;
2.统计所有作者姓(姓名最后一个单词)的出现频率的Top10;
3.统计所有作者姓第一个字符的评率;

# 选择类别为cs.CV下面的论文
data1 = data[data['categories'].apply(lambda x: 'cs.CV' in x)]
data1.head()
authors categories authors_parsed
531 Mahesh Pal cs.NE cs.CV [[Pal, Mahesh, ]]
1408 Serguei A. Mokhov, Stephen Sinclair, Ian Cl\'e... cs.SD cs.CL cs.CV cs.MM cs.NE [[Mokhov, Serguei A., , for the MARF R&D Group...
3231 Chris Aholt, Bernd Sturmfels, Rekha Thomas math.AG cs.CV [[Aholt, Chris, ], [Sturmfels, Bernd, ], [Thom...
4120 Jos\'e I. Ronda, Antonio Vald\'es and Guillerm... cs.CV [[Ronda, José I., ], [Valdés, Antonio, ], [Gal...
4378 Tanaya Guha and Rabab K. Ward cs.CV [[Guha, Tanaya, ], [Ward, Rabab K., ]]
#把作者名字拼接成一个list,其中每个元素为一个作者。
Authors = sum(data1['authors_parsed'],[])
Authors_names = [''.join(x)for x in Authors]
Authors_names = pd.DataFrame(Authors_names)
Authors_names.head()
0
0 PalMahesh
1 MokhovSerguei A.for the MARF R&D Group
2 SinclairStephenfor the MARF R&D Group
3 ClémentIanfor the MARF R&D Group
4 NicolacopoulosDimitriosfor the MARF R&D Group
#画图
plt.figure(figsize=(10,6))
Authors_names[0].value_counts().head(10).plot(kind = 'barh')
<matplotlib.axes._subplots.AxesSubplot at 0x267230dbc40>

在这里插入图片描述

#修改图
plt.figure(figsize=(10,6))
Authors_names[0].value_counts().head(10).plot(kind = 'barh')
names = Authors_names[0].value_counts().index.values[:10]
_=plt.yticks(range(0,len(names)),names)
plt.ylabel('Author')
plt.xlabel('Count')
Text(0.5, 0, 'Count')

在这里插入图片描述

统计姓氏,authors_parsed 字段中作者第⼀个单词:

Authors_lastnames = [x[0] for x in Authors]
Authors_lastnames = pd.DataFrame(Authors_lastnames)
Authors_lastnames.head()
0
0 Pal
1 Mokhov
2 Sinclair
3 Clément
4 Nicolacopoulos
plt.figure(figsize=(10, 6))
Authors_lastnames[0].value_counts().head(10).plot(kind='barh')

names = Authors_lastnames[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')
Text(0.5, 0, 'Count')

在这里插入图片描述


猜你喜欢

转载自blog.csdn.net/qq_36559719/article/details/112727361
今日推荐