pandas库Series使用和ix、loc、iloc基础用法

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/lilong117194/article/details/82499329

1. pandas库Series基础用法:

直接贴出用例:

1. 构造/初始化Series的3种方法:

(1)用列表list构建Series

import pandas as pd
my_list=[7,'Beijing','19大',3.1415,-10000,'Happy']
s=pd.Series(my_list)
print(type(s))
print(s)
<class 'pandas.core.series.Series'>
0           7
1     Beijing
2        193      3.1415
4      -10000
5       Happy
dtype: object

pandas会默认用0到n来做Series的index,但是我们也可以自己指定index,index可以理解为dict里面的key

s=pd.Series([7,'Beijing','19大',3.1415,-10000,'Happy'],
index=['A','B','C','D','E','F'])
print(s)
A           7
B     Beijing
C        19D      3.1415
E      -10000
F       Happy
dtype: object

(2)用字典dict来构建Series,因为Series本身其实就是key-value的结构

cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
print(apts)
Beijing      55000.0
Guangzhou    45000.0
Hangzhou     20000.0
Shanghai     60000.0
Suzhou           NaN
shenzhen     50000.0
Name: income, dtype: float64

(3)用numpy array来构建Series

import numpy as np
d=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
print(d)
a   -0.329401
b   -0.435921
c   -0.232267
d   -0.846713
e   -0.406585
dtype: float64

以上还是比较容易理解的。

2. Series选择数据

(1)可以像对待一个list一样对待一个Series,完成各种切片的操作

import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
print('apts:\n',apts)
print('apts[3]:\n',apts[3])
print('apts[[3,4,1]]:\n',apts[[3,4,1]])
print('apts[:-1]:\n',apts[:-1])
print('apts[1:]+apts[:-1]:\n',apts[1:]+apts[:-1])
apts:
 Beijing      55000.0
Shanghai     60000.0
shenzhen     50000.0
Hangzhou     20000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64
apts[3]:
 20000.0
apts[[3,4,1]]:
 Hangzhou     20000.0
Guangzhou    45000.0
Shanghai     60000.0
Name: income, dtype: float64
apts[:-1]:
 Beijing      55000.0
Shanghai     60000.0
shenzhen     50000.0
Hangzhou     20000.0
Guangzhou    45000.0
Name: income, dtype: float64
apts[1:]+apts[:-1]:
 Beijing           NaN
Guangzhou     90000.0
Hangzhou      40000.0
Shanghai     120000.0
Suzhou            NaN
shenzhen     100000.0
Name: income, dtype: float64

(2)Series可以用来选择数据

import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
print(apts['Shanghai'])
print('Hangzhou' in apts)
print('Choingqing' in apts)
60000.0
True
False

(3)和numpy很像,可以使用numpy的各种函数mean,median,max,min

import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
less_than_50000=(apts<=50000) 
print(apts[less_than_50000])
print(apts.mean()) 
Guangzhou    45000.0
Hangzhou     20000.0
shenzhen     50000.0
Name: income, dtype: float64

46000.0
3. Series元素赋值

直接利用索引值赋值,boolean indexing,在赋值里它也可以用

import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
print(apts)
print('Old income of shenzhen:{}'.format(apts['shenzhen']))
apts['shenzhen']=70000  
print('New income of shenzhen:{}'.format(apts['shenzhen']),'\n')
less_than_50000=(apts<50000)  
print(less_than_50000)
apts[less_than_50000]=40000  
print(apts)
Beijing      55000.0
Shanghai     60000.0
shenzhen     50000.0
Hangzhou     20000.0
Guangzhou    45000.0
Suzhou           NaN
Name: income, dtype: float64
Old income of shenzhen:50000.0 

New income of shenzhen:70000.0 

Beijing      False
Shanghai     False
shenzhen     False
Hangzhou      True
Guangzhou     True
Suzhou       False
Name: income, dtype: bool
Beijing      55000.0
Shanghai     60000.0
shenzhen     70000.0
Hangzhou     40000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64
4. Series数据缺失的简单应用
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
print('apts:\n',apts,'\n')
print(apts.notnull()) # boolean条件
print(apts.isnull())
print(apts[apts.isnull()])   #利用缺失索引布尔值取元素
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
print('apts2:\n',apts2)
apts3=apts+apts2   #索引缺失相加
print('apts3:\n',apts3)
apts3[apts3.isnull()]=300 #将缺失位置赋值为中值
print(apts3)
apts:
 Beijing      55000.0
Shanghai     60000.0
shenzhen     70000.0
Hangzhou     40000.0
Guangzhou    40000.0
Suzhou           NaN
Name: income, dtype: float64 

Beijing       True
Shanghai      True
shenzhen      True
Hangzhou      True
Guangzhou     True
Suzhou       False
Name: income, dtype: bool
Beijing      False
Shanghai     False
shenzhen     False
Hangzhou     False
Guangzhou    False
Suzhou        True
Name: income, dtype: bool
Suzhou   NaN
Name: income, dtype: float64
apts2:
 Beijing      10000
Shanghai      8000
shenzhen      6000
Tianjin      40000
Guangzhou     7000
Chongqing    30000
dtype: int64
apts3:
 Beijing      65000.0
Chongqing        NaN
Guangzhou    47000.0
Hangzhou         NaN
Shanghai     68000.0
Suzhou           NaN
Tianjin          NaN
shenzhen     76000.0
dtype: float64
Beijing      65000.0
Chongqing      300.0
Guangzhou    47000.0
Hangzhou       300.0
Shanghai     68000.0
Suzhou         300.0
Tianjin        300.0
shenzhen     76000.0
dtype: float64

2. Pandas中ix,loc,iloc的区别

import pandas as pd
import numpy as np

data = pd.Series(np.arange(10), index=[49,48,47,46,45, 1, 2, 3, 4, 5])

print('data:\n',data,'\n')
print('data.iloc[:3]:\n',data.iloc[:3],'\n')
print('data.loc[:3]:\n',data.loc[:3],'\n')
print('data.ix[:3]:\n',data.ix[:3],'\n')
data:
 49    0
48    1
47    2
46    3
45    4
1     5
2     6
3     7
4     8
5     9
dtype: int64 

data.iloc[:3]:
 49    0
48    1
47    2
dtype: int64 

data.loc[:3]:
 49    0
48    1
47    2
46    3
45    4
1     5
2     6
3     7
dtype: int64 

data.ix[:3]:
 49    0
48    1
47    2
46    3
45    4
1     5
2     6
3     7
dtype: int64 

loc:在index的标签上进行索引(即是在index上寻找相应的标签,不是下标),范围包括start和end。
iloc:在index的位置上进行索引(即是按照普通的下标寻找),不包括end.
ix:先在index的标签上索引,索引不到就在index的位置上索引(如果index非全整数),不包括end。
为了避免歧义,建议优先选择loc和iloc

>>> data = pd.Series(np.arange(10), index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> data
49    0
48    1
47    2
46    3
45    4
1     5
2     6
3     7
4     8
5     9
>>> data.iloc[:6] # 从下标0开始,不包括下标为6的标签
49    0
48    1
47    2
46    3
45    4
1     5
dtype: int64
>>> data.loc[:6] # 因为index里面不包含标签6,所以报错
...
...
KeyError: 6
>>> data.ix[:6] # 因为index里面不包含标签6,index都是整数,并不是非全整数的情况
...
...
KeyError: 6
>>> data= pd.Series(np.arange(10), index=['a','b','c','d','e', 1, 2, 3, 4, 5])
>>> data
a    0
b    1
c    2
d    3
e    4
1    5
2    6
3    7
4    8
5    9
dtype: int64
>>> data.ix[:6] # 这里不会报错,因为index的标签是非全整数
a    0
b    1
c    2
d    3
e    4
1    5
dtype: int64
>>> data.loc[:6]
TypeError: cannot do slice indexing

这里算是一个pandas的语法笔记。。

参考:
https://blog.csdn.net/cymy001/article/details/78268721
https://blog.csdn.net/zeroder/article/details/54319021

猜你喜欢

转载自blog.csdn.net/lilong117194/article/details/82499329