利用Python进行数据分析--美国BB Name实践
将书中第2章 1880-2010年间全美婴儿姓名的项目作为练习,name数据可在GitHub中获得。
总结如下:
#1.由条件得出基础数据,如loc,values_count,T,unstack,reindex,df[df],apply
#2.由基础数据得到pivot_table
#3.整理pivot_table得出部分想要的数据,如loc,values_count,T,unstack,reindex,df[df],apply
#4.plot画图,如subplots,bar,line
import pandas as pd
import numpy as np
%matplotlib inline
%matplotlib notebook
import matplotlib.pyplot as plt
import sys
# US Baby Names 1880-2010
# 读取数据
# 对year和sex进行枢纽分析
# 插入prop列,表示指定名字的婴儿数相对于总出生数的比例
# 检查 prop=1
# 取出每对 sex/year组合的前1000个名字
# 删除top1000的index
pieces = []
years = range(1880,2011)
for year in years:
location = 'data/names/yob{}.txt'.format(year)
frame = pd.read_csv(location,names=['name','sex','births'])
frame['year']=year
pieces.append(frame)
names = pd.concat(pieces)
table1 = names.pivot_table('births',columns = ['sex'],index='year',aggfunc=sum)
table1.plot(title='group by year,sex')
def get_prop(group):
group['prop']=group['births']/group['births'].sum()
return group
names = names.groupby(by=['year','sex']).apply(get_prop)
names.head()
np.allclose(names.groupby(by=['year','sex'])['prop'].sum(),1)
#True
def get_top1000(group):
top1000 = group[:1000]
return top1000
top1000 = names.groupby(['year','sex']).apply(get_top1000)
top1000
top1000.index = top1000.index.droplevel().droplevel()
top1000
#2.
# 分析命名趋势
# 分男女
# 按year和name建立 数据透析表
# 'John','Harry','Mary','Marilyn'
boys = top1000[top1000['sex']=='M']
girls = top1000[top1000['sex']=='F']
table2 = top1000.pivot_table('births',columns=['name'],index='year',aggfunc=sum)
table2
several_names = ['John','Harry','Mary','Marilyn']
table3 = table2[several_names]
table3.plot(subplots=True)
# 3.
# 评估命名多样性的增长
# 计算top1000所占比例
# 加总为50%的名字个数。只考虑2010年男孩名字
# 加总为50%的名字个数。
top1000_prop = top1000.pivot_table('prop',index='year',columns='sex',aggfunc=sum)
top1000_prop
top1000_prop.plot(kind='line')
top_boy = boys[boys['year']==2010].sort_values(by='births',ascending=False)
top_boy_sort = top_boy['prop'].cumsum()
print(top_boy_sort.searchsorted(0.5)+1)
top_boy_sort.head()
def get_sort(group):
group = group.sort_values(by='prop',ascending=False)
search = group['prop'].cumsum().searchsorted(0.5)+1
return search
table4 = top1000.groupby(by=['year','sex']).apply(get_sort)
table4.head()
# 4.
# “最后一个字母”的变革
# 从name列取出最后一个字母
# 选出有代表性的三年
# 字母比例
# 男孩名字选几个字母做成时间序列
# 变成女孩名字的男孩名字(以及相反的情况)# 找出lesl开头的名字
def get_last_letter(group):
return group[-1]
top1000['last_letter']=top1000['name'].apply(get_last_letter)
top1000.head()
table5 = top1000.pivot_table('prop',columns=['sex','year'],index=['last_letter'],aggfunc=sum)
table5.head()
selected_table5 = table5.reindex(columns=[1910,1960,2010],level='year')
selected_table5.head()
selected_table5['F'].plot(kind='bar')
selected_table5['M'].plot(kind='bar')
select_letter = ['d','n','y']
boy_letter = table5.reindex(columns=['M'],level='sex')
boy_letter
boy_letter2 = boy_letter.T
boy_letter2.index = boy_letter2.index.droplevel()
boy_letter2
boy_letter2 = boy_letter2.loc[:,['d','n','y']]
boy_letter2
boy_letter2.plot(kind='line')
all_names = top1000.name.unique()
mask = np.array(['lesl' in x.lower() for x in all_names])
lesl_like = all_names[mask]
lesl_like
# array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)
lesl_like_table = top1000[top1000.name.isin(lesl_like)]
lesl_like_table
lesl_like_table.groupby('name').births.sum()
table6 = lesl_like_table.pivot_table('births',columns=['sex'],index=['year'],aggfunc=sum)
table6
table7 = table6.div(table6.sum(1),axis=0)
table7
table7.plot()