Python数据分析NumPy和pandas（十二、pandas的基本功能-对DataFrame索引数据）

一、对 DataFrame 进行索引 会检索出单个值、序列、一列或多个列的值。

import numpy as np
import pandas as pd

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])

print(data)

#根据列标签索引检索1列
print(data["two"])

#根据列标签索引序列检索多列
print(data[["three", "one"]])

#根据行索引进行切片 输出row 0和1
print(data[:2])

#使用boolean数组进行切片（通过标量比较得到bool数组）,以下输出所有行中列标签three对应的值大于5的切片
print(data[data["three"] > 5])

注意看上面代码中的注释。输出结果如下：

data对象结果：

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data["two"]结果：

Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32

data[["three", "one"]]结果：

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

data[:2]输出结果：

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

data[data["three"] > 5] 输出结果：

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

对DataFrame对象data的每个元素进行标量比较运算可以得到一个值为Boolean型的新对象，列如：data < 5 输出

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

这样，我们可以使用这种方式将值 0 分配给值为 True 的每个位置，如下：

扫描二维码关注公众号，回复： 17493941 查看本文章

data[data < 5] = 0 打印输出data看下

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data对象位置上小于5的值都赋值为0了。

二、使用 loc 和 iloc 在 DataFrame 上选择

与 Series 一样，DataFrame 具有特殊属性 loc 和 iloc，分别用于基于标签和基于整数的索引。由于 DataFrame 是二维的，因此我们可以使用轴标签（loc）或整数（iloc）用类似 NumPy 的表示法选择行和列的子集。

对于上面的data对象，data.loc["Colorado"] 使用行标签选择一行，选择单行输出的结果是一个Series，其索引是 DataFrame 的列标签。如下输出：

one 0
two 5
three 6
four 7
Name: Colorado, dtype: int32

传递给loc一个行标签序列，可以选择多行，输出结果是新创建的 DataFrame对象：

data.loc[["Colorado", "New York"]] 输出一个新DataFrame对象

	one	two	three	four
Colorado	0	5	6	7
New York	12	13	14	15

我们还可以通过用逗号分隔选择项来组合 loc 中的行和列选择，例如：

data.loc["Colorado", ["two", "three"]] 输出结果：

two 5
three 6
Name: Colorado, dtype: int32

我们还可以使用iloc来检索或切片数据，但在iloc传递的是整数（即标签的序号）。

data.iloc[2] 输出：

one 8
two 9
three 10
four 11
Name: Utah, dtype: int32

data.iloc[[2, 1]] 输出：

	one	two	three	four
Utah	8	9	10	11
Colorado	0	5	6	7

data.iloc[2, [3, 0, 1]] 输出：

four 11
one 8
two 9
Name: Utah, dtype: int32

data.iloc[[1, 2], [3, 0, 1]] 输出：

	four	one	two
Colorado	7	0	5
Utah	11	8	9

loc和iloc函数除了使用单个标签或标签列表进行索引之外，这两个索引函数都适用于切片：

data.loc[:"Utah", "two"] 其中 : :"Utah"表示从data对象的第0行到"Utah"所在的行，"two"表示取改列。输出结果如下：

Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32

data.iloc[:, :3][data.three > 5] 输出：

	one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

布尔数组可以与 loc 一起使用，但不能与 iloc 一起使用：

data.loc[data.three >= 2] 输出结果：

	one	two	three	four
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

有多种方法可以选择和重新排列 pandas 对象中包含的数据。对于 DataFrame下面列表提供了多种常用的索引方式。

三、整数索引陷阱

对于新手来说，使用由整数索引的 pandas 对象可能是一个绊脚石，因为它们的工作方式与内置 Python 数据结构（如列表和元组）不同。例如，我们会不希望以下代码生成错误：

ser = pd.Series(np.arange(3.)) 我们用整数 -1 索引数据 ser[-1] 会发生如下错误：

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "d:\Softs\python-workspace\pydata_ana\test.py", line 6, in <module>
ser[-1]
~~~^^^^
File "D:\Softs\Python311\Lib\site-packages\pandas\core\series.py", line 1121, in __getitem__
return self._get_value(key)
^^^^^^^^^^^^^^^^^^^^
File "D:\Softs\Python311\Lib\site-packages\pandas\core\series.py", line 1237, in _get_value
loc = self.index.get_loc(label)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Softs\Python311\Lib\site-packages\pandas\core\indexes\range.py", line 415, in get_loc
raise KeyError(key) from err
KeyError: -1

但是，对于非整数索引，则没有这种歧义，例如：

ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"]) 用 ser2[-1] 进行索引数据，正常输出2.0，即这种情况下，-1表示输出倒数第一个。对于上面的Series对象ser，可以使用iloc来索引最后一个元素：ser.iloc[-1] 正常输出2.0，不会返回上面的错误。

另一种情况，使用整数切片也可实现正常选择，对于上面的ser对象：

ser[-3:-1] 输出：

0 0.0
1 1.0
dtype: float64

由于整数索引存在这些陷阱，我们最好始终首选使用 loc 和 iloc 进行索引，以避免歧义。

链式索引的缺陷

在上面内容，我们学习了如何使用 loc 和 iloc 在 DataFrame 上进行灵活选择。这些索引属性还可用于就地修改 DataFrame 对象，但这样做也需要小心。

例如，在上面的DataFrame示例对象data中，我们可以按标签或整数位置为列或行赋值：

data.loc[:, "one"] = 1 输出：

	one	two	three	four
Ohio	1	0	0	0
Colorado	1	5	6	7
Utah	1	9	10	11
New York	1	13	14	15

将row索引为2的行赋值为5：data.iloc[2] = 5 输出：

	one	two	three	four
Ohio	1	0	0	0
Colorado	1	5	6	7
Utah	5	5	5	5
New York	1	13	14	15

将four列值大于5所在的行赋值为3： data.loc[data["four"] > 5] = 3 输出：

	one	two	three	four
Ohio	1	0	0	0
Colorado	3	3	3	3
Utah	5	5	5	5
New York	3	3	3	3

如果你想将three列等于5的行，并且想该行第three列的值赋值为6，使用这行代码

data.loc[data.three == 5]["three"] = 6 将会引发一个警告并阻止代码的运行：

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data.loc[data.three == 5]["three"] = 6

建议使用 .loc[row_indexer,col_indexer] = value 方式

用data.loc[data.three == 5, "three"] = 6 输出：

	one	two	three	four
Ohio	1	0	0	0
Colorado	3	3	3	3
Utah	5	5	6	5
New York	3	3	3	3

一个好的经验法则是在执行赋值时避免链式索引。如果感兴趣深入了解，我建议您参考学习在线 pandas 文档中的这个主题。

下一次接着学习算术运算和数据对齐

Python数据分析NumPy和pandas（十二、pandas的基本功能-对DataFrame索引数据）

猜你喜欢