[Panads数据分析-02]Pandas数据结构之DataFrame

# DataFrame是一个表格型的数据结构,它包含一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。
# DataFrame即有行索引下标又有列索引下标,它可以看做是有Series组成的字典(共用同一个索引)
# 构建DataFrame的方法有很多,最常用的一种是直接传入一个由等长列表或Numpy数组组成的字典:
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2001],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
frame
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2001
# 如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列
pd.DataFrame(data, columns=['year', 'state', 'pop'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2001 Nevada 2.9
# 跟Series一样,如果传入的列在数据中找不到,就会产生NA值
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                 index=['one', 'two', 'three', 'four', 'five'])
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2001 Nevada 2.9 NaN
frame2.columns
Index([‘year’, ‘state’, ‘pop’, ‘debt’], dtype=’object’)
# 通过类似字典的标记的方式或属性的方式,可以将DataFrame的列获取为一个Series
frame2['state']
one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object
frame2.year
one 2000 two 2001 three 2002 four 2001 five 2001 Name: year, dtype: int64
# 返回的Series拥有原来DataFrame相同的索引下标,且其name属性也已经相应地设置好了。行也可以通过位置或名称的方式进行获取,比如用索引下标ix
frame2.ix['three']
year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
# 列可以通过赋值的方式进行修改。如:我们可以给那个空的"debt"列赋上一个标量值或一组值
frame2['debt'] = 16.5
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2001 Nevada 2.9 16.5
import numpy as np
frame2['debt'] = np.arange(5.)
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2001 Nevada 2.9 4.0
# 将列表或数组赋给某个列时,其长度必须跟DataFrame的长度相匹配。
# 如果赋值的是一个Series,就会精确匹配到DataFrame的索引,所有的空位都将被填上缺失值:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2001 Nevada 2.9 -1.7
# 为不存在的列赋值会创建出一个新列。关键字del用于删除列:
frame2['eastern'] = frame2.state == 'Ohio'
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2001 Nevada 2.9 -1.7 False
del frame2['eastern']
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2001 Nevada 2.9 -1.7
frame2.columns
Index([‘year’, ‘state’, ‘pop’, ‘debt’], dtype=’object’)
# 另外一种数据形式是嵌套字典(也就是字典的字典)
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
# 可以对该结果进行装置
frame3.T
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
# 内层字典的键会被合并、排序以形成最终的索引
# 如果显示的指定索引,则不会这样
pd.DataFrame(pop, index=[2001, 2002, 2003])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
# 由Series组成的字典差不多也是一样的用法:
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevadd': frame3['Nevada'][:2]}
pd.DataFrame(pdata)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Nevadd Ohio
2000 NaN 1.5
2001 2.4 1.7
# 如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
# 跟Series一样,values属性也会以二维ndarry的形式返回DataFrame中的数据:
frame3.values
array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
# 如果DateFrame各列的数据类型不同,则值数组的数据类型就会被选用能兼容所有列数据类型:
frame2
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2001 Nevada 2.9 -1.7
frame2.values
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2001, 'Nevada', 2.9, -1.7]], dtype=object)

猜你喜欢

转载自blog.csdn.net/caicaiatnbu/article/details/78536719
今日推荐