数据分析day03

4.基于pandas的数据清洗

处理丢失数据

有两种丢失数据：
- None
- np.nan(NaN)

In [1]:

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import tushare as ts#财经数据接口包
import matplotlib.pyplot as plt

两种丢失数据的区别

In [2]:

type(np.nan)

Out[2]:

float

In [5]:

np.nan + 3

Out[5]:

nan

In [3]:

type(None)

Out[3]:

NoneType

pandas中的None和NAN

In [10]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df

Out[10]:

	0	1	2	3	4
0	44	91	92	51	55
1	23	22	92	35	83
2	21	52	40	63	29
3	94	51	24	70	59
4	27	78	1	21	17
5	94	57	5	43	22
6	87	31	58	30	82
7	93	28	54	7	93

In [12]:

df.iloc[1,2] = None
df.iloc[3,4] = None
df.iloc[4,1] = None
df.iloc[7,4] = np.nan

In [13]:

df

Out[13]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
1	23	22.0	NaN	35	83.0
2	21	52.0	40.0	63	29.0
3	94	51.0	24.0	70	NaN
4	27	NaN	1.0	21	17.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0
7	93	28.0	54.0	7	NaN

pandas处理空值操作

isnull
notnull
any
all
dropna
fillna

In [16]:

df.isnull()

Out[16]:

	0	1	2	3	4
0	False	False	False	False	False
1	False	False	True	False	False
2	False	False	False	False	False
3	False	False	False	False	True
4	False	True	False	False	False
5	False	False	False	False	False
6	False	False	False	False	False
7	False	False	False	False	True

检测出原始数据中哪些行中存在空值

In [20]:

df.isnull()

Out[20]:

	0	1	2	3	4
0	False	False	False	False	False
1	False	False	True	False	False
2	False	False	False	False	False
3	False	False	False	False	True
4	False	True	False	False	False
5	False	False	False	False	False
6	False	False	False	False	False
7	False	False	False	False	True

any和all可以帮我们检测df中哪些行列中存在空值
isnull->any(axis=1)
notnull->all(axis=1)

In [24]:

~df.isnull().any(axis=1)
df.loc[~df.isnull().any(axis=1)]

Out[24]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
2	21	52.0	40.0	63	29.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0

In [28]:

df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]

Out[28]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
2	21	52.0	40.0	63	29.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0

In [29]:

df.dropna(axis=0)  #将空值对应的行数据删除

Out[29]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
2	21	52.0	40.0	63	29.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0

In [32]:

df

Out[32]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
1	23	22.0	NaN	35	83.0
2	21	52.0	40.0	63	29.0
3	94	51.0	24.0	70	NaN
4	27	NaN	1.0	21	17.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0
7	93	28.0	54.0	7	NaN

In [33]:

#fillna将空值进行覆盖
df.fillna(method='bfill',axis=0) #使用紧邻值填充空值

Out[33]:

	0	1	2	3	4
0	44	91.0	92.0	51	55.0
1	23	22.0	92.0	35	83.0
2	21	52.0	40.0	63	29.0
3	94	51.0	24.0	70	29.0
4	27	51.0	1.0	21	17.0
5	94	57.0	5.0	43	22.0
6	87	31.0	58.0	30	82.0
7	93	28.0	54.0	7	82.0

面试题

数据说明：
- 数据是1个冷库的温度数据，1-7对应7个温度采集设备，1分钟采集一次。
数据处理目标：
- 用1-4对应的4个必须设备，通过建立冷库的温度场关系模型，预估出5-7对应的数据。
- 最后每个冷库中仅需放置4个设备，取代放置7个设备。
- f(1-4) --> y(5-7)
数据处理过程：
- 1、原始数据中有丢帧现象，需要做预处理；
- 2、matplotlib 绘图；
- 3、建立逻辑回归模型。
无标准答案，按个人理解操作即可，请把自己的操作过程以文字形式简单描述一下，谢谢配合。
测试数据为testData.xlsx

In [5]:

df = pd.read_excel('./testData.xlsx')
df.drop(labels=['none','none1'],axis=1,inplace=True)

In [7]:

df.dropna(axis=0)

In [10]:

#isnull notnull any all
df.notnull().all(axis=1)
df.loc[df.notnull().all(axis=1)]

In [15]:

df.fillna(method='ffill',axis=0).fillna(method='bfill',axis=0)

In [ ]:

处理重复数据

In [20]:

df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
df.iloc[1] = [6,6,6,6,6]
df.iloc[3] = [6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6]
df

Out[20]:

	0	1	2	3	4
0	44	68	53	32	24
1	6	6	6	6	6
2	79	86	73	14	25
3	6	6	6	6	6
4	72	1	73	67	89
5	6	6	6	6	6
6	69	32	94	91	18
7	47	7	77	11	67

In [25]:

df.drop_duplicates(keep='first')

Out[25]:

	0	1	2	3	4
0	44	68	53	32	24
1	6	6	6	6	6
2	79	86	73	14	25
4	72	1	73	67	89
6	69	32	94	91	18
7	47	7	77	11	67

处理异常数据

自定义一个1000行3列（A，B，C）取值范围为0-1的数据源，然后将C列中的值大于其两倍标准差的异常值进行清洗

In [27]:

df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()

Out[27]:

	A	B	C
0	0.886005	0.831529	0.822956
1	0.391742	0.104039	0.434260
2	0.539745	0.950540	0.948526
3	0.392029	0.904831	0.680343
4	0.513764	0.979957	0.600518

In [30]:

#指定一个判定异常值的条件
twice_std = df['C'].std() * 2
twice_std

Out[30]:

0.570731429850527

In [34]:

#判断C列中的哪些值为异常值
df['C'] > twice_std
df.loc[df['C'] > twice_std]
indexs = df.loc[df['C'] > twice_std].index #异常值对应的行索引

In [35]:

df.drop(labels=indexs)

Out[35]:

	A	B	C
1	0.391742	0.104039	0.434260
5	0.512951	0.897948	0.245320
6	0.473968	0.979213	0.271424
7	0.843319	0.038657	0.184559
8	0.982931	0.881284	0.208147
10	0.784656	0.314898	0.089802
13	0.624428	0.252411	0.327818
15	0.213042	0.969693	0.567275
18	0.710010	0.534330	0.559441
19	0.026479	0.736259	0.120431
20	0.990039	0.982449	0.017151
23	0.158157	0.183678	0.428155
25	0.604838	0.950466	0.294742
26	0.304136	0.822809	0.388579
28	0.671559	0.726631	0.196907
30	0.811249	0.751182	0.467697
31	0.376243	0.805516	0.287484
33	0.570442	0.797945	0.026182
35	0.467125	0.062123	0.439725
36	0.861741	0.413997	0.543973
38	0.955328	0.817003	0.293787
47	0.458014	0.228608	0.285172
49	0.931513	0.403981	0.239329
51	0.008178	0.484172	0.021373
53	0.253882	0.300069	0.561118
55	0.752559	0.685649	0.451692
56	0.003363	0.486893	0.154598
57	0.859653	0.569252	0.007432
58	0.327716	0.419704	0.452710
59	0.068403	0.029346	0.226587
...	...	...	...
953	0.247954	0.072558	0.038834
954	0.199553	0.193049	0.027725
956	0.513195	0.175896	0.254432
957	0.080261	0.476756	0.521142
958	0.944795	0.550317	0.336043
961	0.464895	0.592027	0.195383
962	0.127469	0.300982	0.309427
963	0.595242	0.139702	0.450026
964	0.520342	0.639537	0.209403
965	0.372687	0.117984	0.262849
966	0.007270	0.044250	0.533105
967	0.854830	0.512720	0.173844
968	0.247666	0.972284	0.227422
970	0.047074	0.714412	0.392280
974	0.112649	0.483324	0.125105
975	0.307405	0.875641	0.432340
978	0.520662	0.003040	0.412422
979	0.337178	0.540283	0.257443
981	0.877978	0.842195	0.448030
982	0.273752	0.063285	0.291012
985	0.765849	0.974933	0.253099
988	0.139305	0.570496	0.535778
989	0.597190	0.973190	0.177517
990	0.817945	0.183825	0.330112
991	0.738457	0.578425	0.032489
992	0.159229	0.544980	0.242586
994	0.300998	0.352331	0.434336
996	0.609123	0.491735	0.045738
998	0.839935	0.181189	0.121180
999	0.798840	0.939869	0.150332

5.DataFrame的级联and合并操作

级联操作

pd.concat, pd.append

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：

objs
axis=0
keys
join='outer' / 'inner':表示的是级联的方式，outer会将所有的项进行级联（忽略匹配和不匹配），而inner只会将匹配的项级联到一起，不匹配的不级联
ignore_index=False

匹配级联

In [1]:

import numpy as np
import pandas as pd
from pandas import DataFrame

In [2]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df1

Out[2]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [6]:

pd.concat((df1,df1,df1),axis=1)

Out[6]:

	employee	group	employee	group	employee	group
0	Bob	Accounting	Bob	Accounting	Bob	Accounting
1	Jake	Engineering	Jake	Engineering	Jake	Engineering
2	Lisa	Engineering	Lisa	Engineering	Lisa	Engineering

不匹配级联
- 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致
- 有2种连接方式：
  - 外连接：补NaN（默认模式）
  - 内连接：只连接匹配的项

In [7]:

df1

Out[7]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [12]:

df2 = df1.copy()
df2.columns = ['employee','groups']
df2

Out[12]:

	employee	groups
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [14]:

pd.concat((df1,df2),axis=0)

Out[14]:

	employee	group	groups
0	Bob	Accounting	NaN
1	Jake	Engineering	NaN
2	Lisa	Engineering	NaN
0	Bob	NaN	Accounting
1	Jake	NaN	Engineering
2	Lisa	NaN	Engineering

In [16]:

pd.concat((df1,df2),axis=0,join='inner')

Out[16]:

	employee
0	Bob
1	Jake
2	Lisa
0	Bob
1	Jake
2	Lisa

append函数的使用

In [17]:

df1.append(df2)

Out[17]:

	employee	group	groups
0	Bob	Accounting	NaN
1	Jake	Engineering	NaN
2	Lisa	Engineering	NaN
0	Bob	NaN	Accounting
1	Jake	NaN	Engineering
2	Lisa	NaN	Engineering

合并操作:级联是对表格做拼接，合并是对表格中的数据进行汇总

merge与concat的区别在于，merge需要依据某一共同列来进行合并
使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。
注意每一列元素的顺序不要求一致

一对一合并

In [18]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df1

Out[18]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [19]:

df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })
df2

Out[19]:

	employee	hire_date
0	Lisa	2004
1	Bob	2008
2	Jake	2012

In [20]:

pd.merge(df1,df2)

Out[20]:

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004

一对多合并

In [21]:

df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})
df3

Out[21]:

	employee	group	hire_date
0	Lisa	Accounting	2004
1	Jake	Engineering	2016

In [22]:

df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })
df4

Out[22]:

	group	supervisor
0	Accounting	Carly
1	Engineering	Guido
2	Engineering	Steve

In [23]:

pd.merge(df3,df4)

Out[23]:

	employee	group	hire_date	supervisor
0	Lisa	Accounting	2004	Carly
1	Jake	Engineering	2016	Guido
2	Jake	Engineering	2016	Steve

多对多合并

In [24]:

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})
df1

Out[24]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [25]:

df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })
df5

Out[25]:

	group	supervisor
0	Engineering	Carly
1	Engineering	Guido
2	HR	Steve

In [28]:

pd.merge(df1,df5)

Out[28]:

	employee	group	supervisor
0	Jake	Engineering	Carly
1	Jake	Engineering	Guido
2	Lisa	Engineering	Carly
3	Lisa	Engineering	Guido

key的规范化

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

In [29]:

df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                 'group':['Accounting','Finance','Marketing']})
df1

Out[29]:

	employee	group
0	Jack	Accounting
1	Summer	Finance
2	Steve	Marketing

In [30]:

df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                 'hire_date':[2003,2009,2012],
                'group':['Accounting','sell','ceo']})
df2

Out[30]:

	employee	group	hire_date
0	Jack	Accounting	2003
1	Bob	sell	2009
2	Jake	ceo	2012

In [32]:

pd.merge(df1,df2,on='group')

Out[32]:

	employee_x	group	employee_y	hire_date
0	Jack	Accounting	Jack	2003

当两张表没有可进行连接的列时，可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

In [33]:

df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                'group':['Accounting','Product','Marketing'],
               'hire_date':[1998,2017,2018]})
df1

Out[33]:

	employee	group	hire_date
0	Bobs	Accounting	1998
1	Linda	Product	2017
2	Bill	Marketing	2018

In [34]:

df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})
df5

Out[34]:

	hire_dates	name
0	1998	Lisa
1	2016	Bobs
2	2007	Bill

In [35]:

pd.merge(df1,df5,left_on='employee',right_on='name')

Out[35]:

	employee	group	hire_date	hire_dates	name
0	Bobs	Accounting	1998	2016	Bobs
1	Bill	Marketing	2018	2007	Bill

内合并与外合并:out取并集 inner取交集

In [37]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

In [38]:

df6

Out[38]:

	food	name
0	fish	Peter
1	beans	Paul
2	bread	Mary

In [39]:

df7

Out[39]:

	drink	name
0	wine	Mary
1	beer	Joseph

In [43]:

pd.merge(df6,df7,how='right')

Out[43]:

	food	name	drink
0	bread	Mary	wine
1	NaN	Joseph	beer

In [ ]:

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})

In [ ]:

#合并df1和df2
dic1={
    
    'name':['tom','jay','helly'],
    'age':[11,12,33],
    'classRoom':[1,2,3]
}
df1=DataFrame(data=dic1)
df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
              index=['jay','tom','helly'],
             columns=['java','python','c'])

6.人口分析案例

需求：
- 导入文件，查看原始数据
- 将人口数据和各州简称数据进行汇总
- 将汇总的数据中重复的abbreviation列进行删除
- 查看存在缺失数据的列
- 找到有哪些state/region使得state的值为NaN，进行去重操作
- 为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
- 合并各州面积数据areas
- 我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
- 去除含有缺失数据的行
- 找出2010年的全民人口数据
- 计算各州的人口密度
- 排序，并找出人口密度最高的五个州

In [1]:

import pandas as pd
from pandas import DataFrame
import numpy as np

In [10]:

abb = pd.read_csv('./data/state-abbrevs.csv') #存储的是各州的全程和简称数据
area = pd.read_csv('./data/state-areas.csv') #存储的是各州的全程和面积
pop = pd.read_csv('./data/state-population.csv') #人口数据

In [7]:

pop.head(2)

Out[7]:

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0

In [8]:

abb.head(2)

Out[8]:

	state	abbreviation
0	Alabama	AL
1	Alaska	AK

In [12]:

#数据汇总（合并）
abb_pop = pd.merge(abb,pop,how='outer',left_on='abbreviation',right_on='state/region')
abb_pop.head()

Out[12]:

	state	abbreviation	state/region	ages	year	population
0	Alabama	AL	AL	under18	2012	1117489.0
1	Alabama	AL	AL	total	2012	4817528.0
2	Alabama	AL	AL	under18	2010	1130966.0
3	Alabama	AL	AL	total	2010	4785570.0
4	Alabama	AL	AL	under18	2011	1125763.0

In [13]:

abb_pop.drop(labels='abbreviation',axis=1,inplace=True)

In [15]:

#查看存在缺失数据的列
abb_pop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
state           2448 non-null object
state/region    2544 non-null object
ages            2544 non-null object
year            2544 non-null int64
population      2524 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB

In [16]:

##查看存在缺失数据的列
abb_pop.isnull().any(axis=0)

Out[16]:

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

In [19]:

#找到有哪些state/region(洲的简称)使得state（洲的全程）的值为NaN，进行（简称）去重操作
abb_pop.head()

Out[19]:

	state	state/region	ages	year	population
0	Alabama	AL	under18	2012	1117489.0
1	Alabama	AL	total	2012	4817528.0
2	Alabama	AL	under18	2010	1130966.0
3	Alabama	AL	total	2010	4785570.0
4	Alabama	AL	under18	2011	1125763.0

In [24]:

#1.找出全程的空值都有哪些
abb_pop['state'].isnull()
#2.将空值对应的行数据取出
abb_pop.loc[abb_pop['state'].isnull()]
#3.将空值对应的简称取出
abb_pop.loc[abb_pop['state'].isnull()]['state/region']
#4,对取出的简称对应的Series进行去重操作
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()#unique()是用来将Series中的元素进行去重操作

Out[24]:

array(['PR', 'USA'], dtype=object)

In [ ]:

#为找到的这些state/region的state项补上正确的值，从而去除掉state这一列的所有NaN
#将PR简称对应的全程的空值定位到，将这些空值赋值成PR的全程PUERTO RICO

In [29]:

abb_pop['state/region'] == 'PR'
#取出PR简称对应的行数据
abb_pop.loc[abb_pop['state/region'] == 'PR']
#将定位到的行数据中state列中的空值统一赋值成PR的简称
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index#PR简称对应全程的空值的行索引
abb_pop.loc[indexs,'state'] = 'PUERTO RICO'

In [31]:

#同理可以将USA对应全程的值赋值成United States
abb_pop['state/region'] == 'USA'
abb_pop.loc[abb_pop['state/region'] == 'USA']
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
abb_pop.loc[indexs,'state'] = 'United States'

In [39]:

#合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
abb_pop_area.head()

Out[39]:

	state	state/region	ages	year	population	area (sq. mi)
0	Alabama	AL	under18	2012.0	1117489.0	52423.0
1	Alabama	AL	total	2012.0	4817528.0	52423.0
2	Alabama	AL	under18	2010.0	1130966.0	52423.0
3	Alabama	AL	total	2010.0	4785570.0	52423.0
4	Alabama	AL	under18	2011.0	1125763.0	52423.0

In [45]:

#我们会发现area(sq.mi)这一列有缺失数据，找出是哪些行
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
indexs

In [43]:

abb_pop_area.drop(labels=indexs,inplace=True)

In [48]:

#找出2010年的全民人口数据 query('查询条件')
abb_pop_area.query('year==2010 & ages == "total"')

In [ ]:

#计算各州的人口密度

In [51]:

abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head()

Out[51]:

	state	state/region	ages	year	population	area (sq. mi)	midu
0	Alabama	AL	under18	2012.0	1117489.0	52423.0	21.316769
1	Alabama	AL	total	2012.0	4817528.0	52423.0	91.897221
2	Alabama	AL	under18	2010.0	1130966.0	52423.0	21.573851
3	Alabama	AL	total	2010.0	4785570.0	52423.0	91.287603
4	Alabama	AL	under18	2011.0	1125763.0	52423.0	21.474601

In [53]:

#排序，并找出人口密度最高的五个州
abb_pop_area.sort_values(by='midu',axis=0,ascending=False)

Out[53]:

	state	state/region	ages	year	population	area (sq. mi)	midu
391	District of Columbia	DC	total	2013.0	646449.0	68.0	9506.602941
385	District of Columbia	DC	total	2012.0	633427.0	68.0	9315.102941
387	District of Columbia	DC	total	2011.0	619624.0	68.0	9112.117647
431	District of Columbia	DC	total	1990.0	605321.0	68.0	8901.779412
389	District of Columbia	DC	total	2010.0	605125.0	68.0	8898.897059
426	District of Columbia	DC	total	1991.0	600870.0	68.0	8836.323529
429	District of Columbia	DC	total	1992.0	597567.0	68.0	8787.750000
422	District of Columbia	DC	total	1993.0	595302.0	68.0	8754.441176
392	District of Columbia	DC	total	2009.0	592228.0	68.0	8709.235294
425	District of Columbia	DC	total	1994.0	589240.0	68.0	8665.294118
420	District of Columbia	DC	total	1995.0	580519.0	68.0	8537.044118
396	District of Columbia	DC	total	2008.0	580236.0	68.0	8532.882353
406	District of Columbia	DC	total	2001.0	574504.0	68.0	8448.588235
394	District of Columbia	DC	total	2007.0	574404.0	68.0	8447.117647
408	District of Columbia	DC	total	2002.0	573158.0	68.0	8428.794118
419	District of Columbia	DC	total	1996.0	572379.0	68.0	8417.338235
412	District of Columbia	DC	total	2000.0	572046.0	68.0	8412.441176
400	District of Columbia	DC	total	2006.0	570681.0	68.0	8392.367647
410	District of Columbia	DC	total	1999.0	570220.0	68.0	8385.588235
402	District of Columbia	DC	total	2003.0	568502.0	68.0	8360.323529
404	District of Columbia	DC	total	2004.0	567754.0	68.0	8349.323529
417	District of Columbia	DC	total	1997.0	567739.0	68.0	8349.102941
398	District of Columbia	DC	total	2005.0	567136.0	68.0	8340.235294
415	District of Columbia	DC	total	1998.0	565232.0	68.0	8312.235294
421	District of Columbia	DC	under18	1995.0	123620.0	68.0	1817.941176
424	District of Columbia	DC	under18	1994.0	122170.0	68.0	1796.617647
418	District of Columbia	DC	under18	1996.0	121210.0	68.0	1782.500000
423	District of Columbia	DC	under18	1993.0	120471.0	68.0	1771.632353
416	District of Columbia	DC	under18	1997.0	119531.0	68.0	1757.808824
428	District of Columbia	DC	under18	1992.0	118636.0	68.0	1744.647059
...	...	...	...	...	...	...	...
53	Alaska	AK	total	1994.0	603308.0	656425.0	0.919081
56	Alaska	AK	total	1993.0	599434.0	656425.0	0.913180
50	Alaska	AK	total	1992.0	588736.0	656425.0	0.896882
55	Alaska	AK	total	1991.0	570193.0	656425.0	0.868634
48	Alaska	AK	total	1990.0	553290.0	656425.0	0.842884
63	Alaska	AK	under18	1998.0	192636.0	656425.0	0.293462
66	Alaska	AK	under18	1999.0	191422.0	656425.0	0.291613
69	Alaska	AK	under18	2000.0	190615.0	656425.0	0.290384
71	Alaska	AK	under18	2001.0	188771.0	656425.0	0.287574
73	Alaska	AK	under18	2002.0	188482.0	656425.0	0.287134
92	Alaska	AK	under18	2011.0	188329.0	656425.0	0.286901
62	Alaska	AK	under18	1997.0	188280.0	656425.0	0.286826
94	Alaska	AK	under18	2012.0	188162.0	656425.0	0.286647
86	Alaska	AK	under18	2013.0	188132.0	656425.0	0.286601
90	Alaska	AK	under18	2010.0	187902.0	656425.0	0.286251
54	Alaska	AK	under18	1994.0	187439.0	656425.0	0.285545
57	Alaska	AK	under18	1993.0	187190.0	656425.0	0.285166
75	Alaska	AK	under18	2003.0	186843.0	656425.0	0.284637
89	Alaska	AK	under18	2009.0	186351.0	656425.0	0.283888
77	Alaska	AK	under18	2004.0	186335.0	656425.0	0.283863
81	Alaska	AK	under18	2006.0	185580.0	656425.0	0.282713
61	Alaska	AK	under18	1996.0	185360.0	656425.0	0.282378
79	Alaska	AK	under18	2005.0	185304.0	656425.0	0.282293
59	Alaska	AK	under18	1995.0	184990.0	656425.0	0.281814
52	Alaska	AK	under18	1992.0	184878.0	656425.0	0.281644
83	Alaska	AK	under18	2007.0	184344.0	656425.0	0.280830
85	Alaska	AK	under18	2008.0	183124.0	656425.0	0.278972
51	Alaska	AK	under18	1991.0	182180.0	656425.0	0.277534
49	Alaska	AK	under18	1990.0	177502.0	656425.0	0.270407
2544	Puerto Rico	NaN	NaN	NaN	NaN	3515.0	NaN

7.pandas高级操作

In [3]:

import pandas as pd
from pandas import DataFrame
import numpy as np

替换操作

替换操作可以同步作用于Series和DataFrame中
单值替换
- 普通替换：替换所有符合要求的元素:to_replace=15,value='e'
- 按列指定单值替换： to_replace={列标签：替换值} value='value'
多值替换
- 列表替换: to_replace=[] value=[]
- 字典替换（推荐） to_replace={to_replace:value,to_replace:value}

In [4]:

df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
df

Out[4]:

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [5]:

df.replace(to_replace=3,value='Three')

Out[5]:

	0	1	2	3	4	5	6
0	44	62	Three	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [6]:

df.replace(to_replace={3:'aaa'})

Out[6]:

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

In [8]:

#替换指定列中的值
df.replace(to_replace={5:77},value='6666666')

Out[8]:

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	6666666	49

映射操作

概念：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定（给一个元素值提供不同的表现形式）
创建一个df，两列分别是姓名和薪资，然后给其名字起对应的中文名

In [10]:

dic = {
    'name':['jay','tom','jay'],
    'salary':[10000,20000,10000]
}
df = DataFrame(data=dic)
df

Out[10]:

	name	salary
0	jay	10000
1	tom	20000
2	jay	10000

In [14]:

#映射关系表
dic = {
    'jay':'张三',
    'tom':'李四'
}
df['c_name'] = df['name'].map(dic)
df

Out[14]:

	name	salary	c_name
0	jay	10000	张三
1	tom	20000	李四
2	jay	10000	张三

运算工具

超过3000部分的钱缴纳50%的税，计算每个人的税后薪资

In [16]:

def after_sal(s):
    return s - (s-3000)*0.5

In [18]:

df['after_salary'] = df['salary'].map(after_sal)
df

Out[18]:

	name	salary	c_name	after_salary
0	jay	10000	张三	6500.0
1	tom	20000	李四	11500.0
2	jay	10000	张三	6500.0

映射索引

使用rename()函数替换行列索引
参数介绍：
- index 替换行索引
- columns 替换列索引

In [19]:

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
df4

Out[19]:

	color	value
0	white	2
1	gray	5
2	purple	9
3	blue	0
4	green	1

In [20]:

new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
new_col={'color':'cc','value':'vv'}
df4.rename(new_index,columns=new_col)

Out[20]:

	cc	vv
first	white	2
two	gray	5
three	purple	9
four	blue	0
five	green	1

排序实现的随机抽样

take()
np.random.permutation()

In [22]:

df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
df

In [24]:

# df.take(['B','A','C'],axis=1)
df.take([1,0,2],axis=1)

In [32]:

np.random.permutation(3) #返回随机序列

Out[32]:

array([0, 1, 2])

In [31]:

#将行列索引打乱
df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)

In [35]:

df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]

数据的分类处理

数据分类处理的核心：
- groupby()函数
- groups属性查看分组情况

In [36]:

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                'price':[4,3,3,2.5,4,2],
               'color':['red','yellow','yellow','green','green','green'],
               'weight':[12,20,50,30,20,44]})
df

Out[36]:

	color	item	price	weight
0	red	Apple	4.0	12
1	yellow	Banana	3.0	20
2	yellow	Orange	3.0	50
3	green	Banana	2.5	30
4	green	Orange	4.0	20
5	green	Apple	2.0	44

In [37]:

#根据水果的种类进行分组
df.groupby(by='item')

Out[37]:

<pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>

In [38]:

#调用groups查看分组情况
df.groupby(by='item').groups

Out[38]:

{'Apple': Int64Index([0, 5], dtype='int64'),
 'Banana': Int64Index([1, 3], dtype='int64'),
 'Orange': Int64Index([2, 4], dtype='int64')}

In [40]:

#计算出每一种水果的平均价格
df.groupby(by='item').mean()['price']

Out[40]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [41]:

df.groupby(by='item')['price'].mean() #推荐

Out[41]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [42]:

#计算不同颜色水果的平均重量
df.groupby(by='color')['weight'].mean()

Out[42]:

color
green     31.333333
red       12.000000
yellow    35.000000
Name: weight, dtype: float64

In [44]:

#将每一种水果的平均价格计算出来且汇总到原数据中
df

Out[44]:

	color	item	price	weight
0	red	Apple	4.0	12
1	yellow	Banana	3.0	20
2	yellow	Orange	3.0	50
3	green	Banana	2.5	30
4	green	Orange	4.0	20
5	green	Apple	2.0	44

In [47]:

series_price = df.groupby(by='item')['price'].mean() 
dic = series_price.to_dict()
dic #映射关系表

Out[47]:

{'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}

In [49]:

df['mean_price'] = df['item'].map(dic)
df

Out[49]:

	color	item	price	weight	mean_price
0	red	Apple	4.0	12	3.00
1	yellow	Banana	3.0	20	2.75
2	yellow	Orange	3.0	50	3.50
3	green	Banana	2.5	30	2.75
4	green	Orange	4.0	20	3.50
5	green	Apple	2.0	44	3.00

高级数据聚合

使用groupby分组后，也可以使用transform和apply提供自定义函数实现更多的运算
df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
transform和apply都会进行运算，在transform或者apply中传入函数即可
transform和apply也可以传入一个lambda表达式

In [56]:

def myMean(s):
    sum = 0
    for i in s:
        sum += i
    return sum/len(s)

In [57]:

df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具

Out[57]:

item
Apple     3.00
Banana    2.75
Orange    3.50
Name: price, dtype: float64

In [58]:

df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具

Out[58]:

0    3.00
1    2.75
2    3.50
3    2.75
4    3.50
5    3.00
Name: price, dtype: float64

数据加载

读取type-.txt文件数据

In [1]:

import pandas as pd
from pandas import DataFrame,Series
data=pd.read_csv('./data/type-.txt')
data

Out[1]:

	你好-我好-他也好
0	也许-大概-有可能
1	然而-未必-不见得

In [2]:

data.shape

Out[2]:

(2, 1)

将文件中每一个词作为元素存放在DataFrame中

In [4]:

data=pd.read_csv('./data/type-.txt',sep='-',header=None)
data

Out[4]:

	0	1	2
0	你好	我好	他也好
1	也许	大概	有可能
2	然而	未必	不见得

读取数据库中的数据

In [6]:

#连接数据库，获取连接对象
import sqlite3 as sqlite3
conn=sqlite3.connect('./data/weather_2012.sqlite')

In [7]:

#读取库表中的数据值
sql_df=pd.read_sql('select * from weather_2012',conn)
sql_df

In [ ]:

#将一个df中的数据值写入存储到db
data_1.to_sql('sql_data',conn)

透视表

透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表，也体会到它的强大功能，而在pandas中它被称作pivot_table。
透视表的优点：
- 灵活性高，可以随意定制你的分析计算要求
- 脉络清晰易于理解数据
- 操作性强，报表神器

In [16]:

import pandas as pd
import numpy as np
df = pd.read_csv('./data/透视表-篮球赛.csv',encoding='utf8')
df.head()

Out[16]:

	对手	胜负	主客场	命中	投篮数	投篮命中率	3分命中率	篮板	助攻	得分
0	勇士	胜	客	10	23	0.435	0.444	6	11	27
1	国王	胜	客	8	21	0.381	0.286	3	9	27
2	小牛	胜	主	10	19	0.526	0.462	3	7	29
3	灰熊	负	主	8	20	0.400	0.250	5	8	22
4	76人	胜	客	10	20	0.500	0.250	3	13	27

pivot_table有四个最重要的参数index、values、columns、aggfunc

index参数：分类汇总的分类条件
- 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个对进行分类并计算其各类得分的平均值：

In [14]:

df.pivot_table(index='对手')

想看看对阵同一对手在不同主客场下的数据，分类条件为对手和主客场

In [17]:

df.pivot_table(index=['对手','主客场'])

values参数：需要对计算的数据进行筛选
- 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据：

In [18]:

df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'])

Out[18]:

		助攻	得分	篮板
主客场	胜负
主	胜	10.555556	34.222222	5.444444
负	8.666667	29.666667	5.000000
客	胜	9.000000	32.000000	4.916667
负	8.000000	20.000000	4.000000

Aggfunc参数：设置我们对数据聚合时进行的函数操作
- 当我们未设置aggfunc时，它默认aggfunc='mean'计算均值。
还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时：

In [20]:

df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'],aggfunc=['sum','mean'])

Out[20]:

		sum	mean
		助攻	得分	篮板	助攻	得分	篮板
主客场	胜负
主	胜	95	308	49	10.555556	34.222222	5.444444
负	26	89	15	8.666667	29.666667	5.000000
客	胜	108	384	59	9.000000	32.000000	4.916667
负	8	20	4	8.000000	20.000000	4.000000

Columns:可以设置列层次字段
- 对values字段进行分类

In [33]:

#获取所有队主客场的总得分
df.pivot_table(index='主客场',values='得分',aggfunc='sum',fill_value=0)

Out[33]:

	得分
主客场
主	397
客	404

In [34]:

#获取每个队主客场的总得分（在总得分的基础上又进行了对手的分类）
df.pivot_table(index='主客场',values='得分',columns='对手',aggfunc='sum',fill_value=0)

Out[34]:

对手	76人	勇士	国王	太阳	小牛	尼克斯	开拓者	掘金	步行者	湖人	灰熊	爵士	猛龙	篮网	老鹰	骑士	鹈鹕	黄蜂
主客场
主	29	0	0	0	29	37	0	21	29	0	60	56	38	37	0	35	26	0
客	27	27	27	48	0	31	48	0	26	36	49	29	0	0	29	0	0	27

交叉表

是一种用于计算分组的特殊透视图,对数据进行汇总
pd.crosstab(index,colums)
- index:分组数据，交叉表的行索引
- columns:交叉表的列索引

In [35]:

df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
               'age':[15,23,25,17,35,57,24,31,22],
               'smoke':[True,False,False,True,True,False,False,True,False],
               'height':[168,179,181,166,173,178,188,190,160]})

求出各个性别抽烟的人数

In [36]:

pd.crosstab(df.smoke,df.sex)

Out[36]:

sex	man	women
smoke
False	2	3
True	2	2

求出各个年龄段抽烟人情况

In [14]:

pd.crosstab(df.age,df.smoke)

Out[14]:

smoke	False	True
age
15	0	1
17	0	1
22	1	0
23	1	0
24	1	0
25	1	0
31	0	1
35	0	1
57	1	0

	0	1	2	3	4
0	44	91	92	51	55
1	23	22	92	35	83
2	21	52	40	63	29
3	94	51	24	70	59
4	27	78	1	21	17
5	94	57	5	43	22
6	87	31	58	30	82
7	93	28	54	7	93

	0	1	2	3	4
0	44	68	53	32	24
1	6	6	6	6	6
2	79	86	73	14	25
3	6	6	6	6	6
4	72	1	73	67	89
5	6	6	6	6	6
6	69	32	94	91	18
7	47	7	77	11	67

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

	0	1	2	3	4
0	44	91	92	51	55
1	23	22	92	35	83
2	21	52	40	63	29
3	94	51	24	70	59
4	27	78	1	21	17
5	94	57	5	43	22
6	87	31	58	30	82
7	93	28	54	7	93

	0	1	2	3	4
0	44	68	53	32	24
1	6	6	6	6	6
2	79	86	73	14	25
3	6	6	6	6	6
4	72	1	73	67	89
5	6	6	6	6	6
6	69	32	94	91	18
7	47	7	77	11	67

	0	1	2	3	4	5	6
0	44	62	3	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49

	0	1	2	3	4	5	6
0	44	62	aaa	85	26	47	14
1	15	78	32	98	85	4	51
2	53	75	87	21	45	8	18
3	54	31	67	49	77	25	49
4	18	21	18	31	93	11	0
5	21	54	76	95	70	77	49