任务2：论文作者统计

接着上一节继续，需要统计所有论文作者的信息，废话不多说，直接上代码。

import numpy as np
import pandas as pd
import re 
import json
import matplotlib.pyplot as plt

data = []
with open(r'arxiv-metadata-oai-2019.json', 'r') as f: 
    for idx, line in enumerate(f): 
        d = json.loads(line)
        d = {
    
    'authors': d['authors'], 'categories': d['categories'], 'authors_parsed': d['authors_parsed']}
        data.append(d)        
data = pd.DataFrame(data)
data.head()

	authors	categories	authors_parsed
0	Sung-Chul Yoon, Philipp Podsiadlowski and Step...	astro-ph	[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1	B. Dugmore and PP. Ntumba	math.AT	[[Dugmore, B., ], [Ntumba, PP., ]]
2	T.V. Zaqarashvili and K Murawski	astro-ph	[[Zaqarashvili, T. V., ], [Murawski, K, ]]
3	Sezgin Aygun, Ismail Tarhan, Husnu Baysal	gr-qc	[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4	Antonio Pipino (1,3), Thomas H. Puzia (2,4), a...	astro-ph	[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...

数据统计

1.统计所有作者姓名出现频率的Top10；
2.统计所有作者姓（姓名最后一个单词）的出现频率的Top10；
3.统计所有作者姓第一个字符的评率；

# 选择类别为cs.CV下面的论文
data1 = data[data['categories'].apply(lambda x: 'cs.CV' in x)]
data1.head()

	authors	categories	authors_parsed
531	Mahesh Pal	cs.NE cs.CV	[[Pal, Mahesh, ]]
1408	Serguei A. Mokhov, Stephen Sinclair, Ian Cl\'e...	cs.SD cs.CL cs.CV cs.MM cs.NE	[[Mokhov, Serguei A., , for the MARF R&D Group...
3231	Chris Aholt, Bernd Sturmfels, Rekha Thomas	math.AG cs.CV	[[Aholt, Chris, ], [Sturmfels, Bernd, ], [Thom...
4120	Jos\'e I. Ronda, Antonio Vald\'es and Guillerm...	cs.CV	[[Ronda, José I., ], [Valdés, Antonio, ], [Gal...
4378	Tanaya Guha and Rabab K. Ward	cs.CV	[[Guha, Tanaya, ], [Ward, Rabab K., ]]

#把作者名字拼接成一个list，其中每个元素为一个作者。
Authors = sum(data1['authors_parsed'],[])

Authors_names = [''.join(x)for x in Authors]
Authors_names = pd.DataFrame(Authors_names)
Authors_names.head()

	0
0	PalMahesh
1	MokhovSerguei A.for the MARF R&D Group
2	SinclairStephenfor the MARF R&D Group
3	ClémentIanfor the MARF R&D Group
4	NicolacopoulosDimitriosfor the MARF R&D Group

#画图
plt.figure(figsize=(10,6))
Authors_names[0].value_counts().head(10).plot(kind = 'barh')

<matplotlib.axes._subplots.AxesSubplot at 0x267230dbc40>

在这里插入图片描述

#修改图
plt.figure(figsize=(10,6))
Authors_names[0].value_counts().head(10).plot(kind = 'barh')
names = Authors_names[0].value_counts().index.values[:10]
_=plt.yticks(range(0,len(names)),names)
plt.ylabel('Author')
plt.xlabel('Count')

Text(0.5, 0, 'Count')

在这里插入图片描述

统计姓氏，authors_parsed 字段中作者第⼀个单词：

Authors_lastnames = [x[0] for x in Authors]
Authors_lastnames = pd.DataFrame(Authors_lastnames)
Authors_lastnames.head()

	0
0	Pal
1	Mokhov
2	Sinclair
3	Clément
4	Nicolacopoulos

plt.figure(figsize=(10, 6))
Authors_lastnames[0].value_counts().head(10).plot(kind='barh')

names = Authors_lastnames[0].value_counts().index.values[:10]
_ = plt.yticks(range(0, len(names)), names)
plt.ylabel('Author')
plt.xlabel('Count')

Text(0.5, 0, 'Count')

在这里插入图片描述

论文数据分析-2(作者数据统计)

任务2：论文作者统计

数据统计

猜你喜欢