任务2:论文作者统计
接着上一节继续,需要统计所有论文作者的信息,废话不多说,直接上代码。
import numpy as np
import pandas as pd
import re
import json
import matplotlib. pyplot as plt
data = [ ]
with open ( r'arxiv-metadata-oai-2019.json' , 'r' ) as f:
for idx, line in enumerate ( f) :
d = json. loads( line)
d = {
'authors' : d[ 'authors' ] , 'categories' : d[ 'categories' ] , 'authors_parsed' : d[ 'authors_parsed' ] }
data. append( d)
data = pd. DataFrame( data)
data. head( )
authors
categories
authors_parsed
0
Sung-Chul Yoon, Philipp Podsiadlowski and Step...
astro-ph
[[Yoon, Sung-Chul, ], [Podsiadlowski, Philipp,...
1
B. Dugmore and PP. Ntumba
math.AT
[[Dugmore, B., ], [Ntumba, PP., ]]
2
T.V. Zaqarashvili and K Murawski
astro-ph
[[Zaqarashvili, T. V., ], [Murawski, K, ]]
3
Sezgin Aygun, Ismail Tarhan, Husnu Baysal
gr-qc
[[Aygun, Sezgin, ], [Tarhan, Ismail, ], [Baysa...
4
Antonio Pipino (1,3), Thomas H. Puzia (2,4), a...
astro-ph
[[Pipino, Antonio, ], [Puzia, Thomas H., ], [M...
数据统计
1.统计所有作者姓名出现频率的Top10; 2.统计所有作者姓(姓名最后一个单词)的出现频率的Top10; 3.统计所有作者姓第一个字符的评率;
data1 = data[ data[ 'categories' ] . apply ( lambda x: 'cs.CV' in x) ]
data1. head( )
authors
categories
authors_parsed
531
Mahesh Pal
cs.NE cs.CV
[[Pal, Mahesh, ]]
1408
Serguei A. Mokhov, Stephen Sinclair, Ian Cl\'e...
cs.SD cs.CL cs.CV cs.MM cs.NE
[[Mokhov, Serguei A., , for the MARF R&D Group...
3231
Chris Aholt, Bernd Sturmfels, Rekha Thomas
math.AG cs.CV
[[Aholt, Chris, ], [Sturmfels, Bernd, ], [Thom...
4120
Jos\'e I. Ronda, Antonio Vald\'es and Guillerm...
cs.CV
[[Ronda, José I., ], [Valdés, Antonio, ], [Gal...
4378
Tanaya Guha and Rabab K. Ward
cs.CV
[[Guha, Tanaya, ], [Ward, Rabab K., ]]
Authors = sum ( data1[ 'authors_parsed' ] , [ ] )
Authors_names = [ '' . join( x) for x in Authors]
Authors_names = pd. DataFrame( Authors_names)
Authors_names. head( )
0
0
PalMahesh
1
MokhovSerguei A.for the MARF R&D Group
2
SinclairStephenfor the MARF R&D Group
3
ClémentIanfor the MARF R&D Group
4
NicolacopoulosDimitriosfor the MARF R&D Group
plt. figure( figsize= ( 10 , 6 ) )
Authors_names[ 0 ] . value_counts( ) . head( 10 ) . plot( kind = 'barh' )
<matplotlib.axes._subplots.AxesSubplot at 0x267230dbc40>
plt. figure( figsize= ( 10 , 6 ) )
Authors_names[ 0 ] . value_counts( ) . head( 10 ) . plot( kind = 'barh' )
names = Authors_names[ 0 ] . value_counts( ) . index. values[ : 10 ]
_= plt. yticks( range ( 0 , len ( names) ) , names)
plt. ylabel( 'Author' )
plt. xlabel( 'Count' )
Text(0.5, 0, 'Count')
统计姓氏,authors_parsed 字段中作者第⼀个单词:
Authors_lastnames = [ x[ 0 ] for x in Authors]
Authors_lastnames = pd. DataFrame( Authors_lastnames)
Authors_lastnames. head( )
0
0
Pal
1
Mokhov
2
Sinclair
3
Clément
4
Nicolacopoulos
plt. figure( figsize= ( 10 , 6 ) )
Authors_lastnames[ 0 ] . value_counts( ) . head( 10 ) . plot( kind= 'barh' )
names = Authors_lastnames[ 0 ] . value_counts( ) . index. values[ : 10 ]
_ = plt. yticks( range ( 0 , len ( names) ) , names)
plt. ylabel( 'Author' )
plt. xlabel( 'Count' )
Text(0.5, 0, 'Count')