Python数据分析NumPy和pandas（十三、pandas的数据对齐、数据填充以及DataFrame与Series之间的运算等）

一、算术运算和数据对齐

pandas使得处理具有不同索引的对象变得更加简单。在两个Series对象进行加法运算时，如果两个对象有不同的索引，则结果中的索引将是这两个对象索引的并集。

import numpy as np
import pandas as pd

# pandas使得处理具有不同索引的对象变得更加简单。
# 例如，在两个Series对象进行加法运算时，如果两个对象有不同的索引，则结果中的相应索引将是这两个对象索引的并集。
# 让我们看一个例子：
# s1的索引是a, c, d, e 。 s2的索引是a, c, e, f, g
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s = s1 + s2

#输出s1, s2和相加后的结果对象s
print(s1)
print(s2)
print(s)

输出s1、s2和计算结果如下：

a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64

从输出结果可以看出：内部数据对齐会在不重叠的标签位置中引入缺失值NaN。然后，缺失值将在进一步的算术计算中传播。

对于DataFrame，将对行和列执行对齐。

import numpy as np
import pandas as pd

#对于DataFrame，将对行和列执行对齐
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"), 
                   index=["Ohio", "Texas", "Colorado"])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), 
                   index=["Utah", "Ohio", "Texas", "Oregon"])

#df1与df2相加将返回一个 DataFrame，其中索引和列是每个DataFrame中的并集：
df = df1 + df2
print(df1)
print(df2)
print(df)

输出结果如下：

	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado	6.0	7.0	8.0

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

从输出结果可以看出，由于在两个DataFrame 对象df1和df2中不都有 “c” 和 “e” 列，因此它们在结果中显示为缺失值。对于两个对象未同时都有的行标签，结果中该行也显示缺失值，如df1中的Colorado，df2中的Utah、Oregon。

如果相加的 DataFrame 对象没有共同的列或行标签，则结果将都是NaN缺失值：

import numpy as np
import pandas as pd

#如果相加的 DataFrame 对象没有共同的列或行标签，则结果将包含所有 NaN：
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
df = df1 + df2
print(df1)
print(df2)
print(df)

输出结果：

	A
0	1
1	2

	B
0	3
1	4

	A	B
0	NaN	NaN
1	NaN	NaN

二、算数运算中填充值

在不同索引对象之间的算术运算中，当一个对象中找到轴标签但在另一个对象中找不到时，我们可能希望用特殊值来填充，例如 0。下面我们举个例子，通过将 np.nan 将某个特定值设置为 NaN。

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), 
                   columns=list("abcde"))

#对df2的[1, "b"]位置的元素赋值为NaN
df2.loc[1, "b"] = np.nan
print(df1)
print(df2)

#执行df1+df2操作
df = df1 + df2
print(df)

输出结果：

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

从以上结果可以看出df2位于[1, "b"]位置的元素被赋值为NaN。df1+df2操作输出的结果中没有共同标签和索引的位置元素用NaN填充。但在做数据分析时，我们可能不想用NaN来填充，而是保持原有值，这种情况我们可以使用add函数和fill_value指定填充值得方法，如下：

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), 
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), 
                   columns=list("abcde"))

#对df2的[1, "b"]位置的元素赋值为NaN
df2.loc[1, "b"] = np.nan

#用add函数和fill_value参数执行df1+df2操作
df = df1.add(df2, fill_value=0)
print(df)

这样计算得结果如下：

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

下面得列表提供了以下灵活得算术方法：

1 / df1 等价于 df1.rdiv(1) 输出：

	a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250	0.200000	0.166667	0.142857
2	0.125	0.111111	0.100000	0.090909

对于以上df1和df2，我们对df1进行reindex操作：df1.reindex(columns=df2.columns, fill_value=0)

用df2的列对df1进行列索引重构，会增加一个e列，用fill_value指定e列用0填充，输出：

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

三、DataFrame 和 Series 之间的操作

我们先用一个NumPy示例来做演示，为后面学习DataFrame与Series之前的操作做一个对比。这个示例用一个NumPy数组与他的其中一行做减法运算。

import numpy as np
import pandas as pd

#构建一个NumPy数组，用reshape设置他的形状为三行四列
arr = np.arange(12.).reshape((3, 4))
print(arr)
#输出第0行
print(arr[0])

#用arr数组减去他的第一行，我们看看输出结果
a =  arr - arr[0]
print(a)

输出结果：

[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
[0. 1. 2. 3.]
[[0. 0. 0. 0.]
[4. 4. 4. 4.]
[8. 8. 8. 8.]]

从输出结果可以看出arr - arr[0] 会将arr的每一行与arr[0]相减。这也被称为广播。下面我们可以看到DataFrame 和 Series 之间的操作类似：

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)
series = frame.iloc[0]
print(series)
res = frame - series
print(res)

frame输出：

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

series输出：

b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64

frame - series 输出：

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

默认情况下，DataFrame 和 Series 之间的算术操作， DataFrame 列上的索引与 Series 索引匹配，并向下广播各行运算。如果在 DataFrame 的列或 Series 的索引中都找不到匹配索引，则结果对象将被重新索引形成联合（类似于union操作）。例如：

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
print(series2)

res = frame + series2
print(res)

frame输出：

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

series2 输出：

b 0
e 1
f 2
dtype: int32

frame + series2 输出：

	b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

从结果可以看出，未匹配的索引所对应的值用NaN填充了。

如果要改为对列进行广播，在行上进行匹配，则要使用一种算术方法并指定在索引上进行匹配。例如：

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)
series3 = series3 = frame["d"]
print(series3)

res = frame.sub(series3, axis="index")
print(res)

frame输出：

b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0

series3输出：

Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64

frame.sub(series3, axis="index") frame减去series3，按列广播，输出：

b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0

上面的例子传递的轴是要匹配的轴。在这种情况下，我们的意思是匹配 DataFrame 的行索引（axis=“index”）并跨列广播。

四、Function应用与映射

NumPy ufuncs（元素级数组方法）也适用于 pandas 对象。

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.random.standard_normal((4, 3)), 
                     columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)
print(np.abs(frame))

上面的代码我们用了NumPy的abs方法，求frame的绝对值，以上输出：

b d e
Utah -0.821878 0.615772 -1.658571
Ohio 1.185280 -1.522521 -0.606501
Texas 2.082847 -0.657700 -2.026195
Oregon -0.255501 -0.292001 -0.511458

求绝对值后输出：

b d e
Utah 0.821878 0.615772 1.658571
Ohio 1.185280 1.522521 0.606501
Texas 2.082847 0.657700 2.026195
Oregon 0.255501 0.292001 0.511458

另一个常见的操作是将一维数组上的函数应用于每一列或每一行。

import numpy as np
import pandas as pd

def f1(x):
    return x.max() - x.min()

frame = pd.DataFrame(np.random.standard_normal((4, 3)), 
                     columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)

res = frame.apply(f1)
print(res)

以上代码首先定义了一个函数f1，然后用NumPy创建了一个4*3的标准正态分布数组，基于这个数组构造了一个DataFrame对象frame，最后将frame对象运用于f1函数，这里f1函数返回frame每列的最大值减去最小值的结果，结果是一个 Series，用frame的列作为其索引。上面代码输出如下：

b d e
Utah 1.077379 0.857446 1.449707
Ohio -0.887761 0.443201 1.345005
Texas -2.337501 -1.673620 0.499202
Oregon 1.481608 0.508004 -0.944054
b 3.819109
d 2.531066
e 2.393761
dtype: float64

如果我们再传递一个 axis=“columns” 参数，则该函数将每行调用一次。如下：

import numpy as np
import pandas as pd

def f1(x):
    return x.max() - x.min()


frame = pd.DataFrame(np.random.standard_normal((4, 3)), 
                     columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)

res = frame.apply(f1, axis="columns")
print(res)

输出结果如下（返回一个series，其元素值是每行的最大值与最小值得差，索引是frame的行索引）：

b d e
Utah -1.187664 1.646772 -0.937064
Ohio -2.126515 0.017911 1.342567
Texas -0.069987 0.486575 -0.564158
Oregon 1.895647 0.295122 -1.274326
Utah 2.834436
Ohio 3.469081
Texas 1.050733
Oregon 3.169973
dtype: float64

传递给 apply 的函数不仅可以返回标量值，还可以返回具有多组值的 Series，例如：

import numpy as np
import pandas as pd

def f1(x):
    return x.max() - x.min()

def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])


frame = pd.DataFrame(np.random.standard_normal((4, 3)), 
                     columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)

res = frame.apply(f2)
print(res)

以上代码我们定义了一个f2函数，该函数将返回传入对象的列最小值和最大值所组成的Series对象，指定了返回对象的索引为min和max，输出结果如下：

b d e
Utah -0.786485 0.711256 -0.505295
Ohio -0.165102 1.032220 1.672411
Texas 0.300449 -0.079302 0.247862
Oregon 0.917424 1.520003 -1.321421
b d e
min -0.786485 -0.079302 -1.321421
max 0.917424 1.520003 1.672411

我们也可以使用元素级 Python 函数。假设我们希望将 frame 中的每个元素以浮点值格式化输出。可以使用 applymap 执行此操作：

import numpy as np
import pandas as pd

def f1(x):
    return x.max() - x.min()

def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

def my_format(x):
    return f"{x:.2f}"

frame = pd.DataFrame(np.random.standard_normal((4, 3)), 
                     columns=list("bde"), 
                     index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)

res = frame.applymap(my_format)
print(res)

以上代码增加了一个格式化函数my_format，该函数输出保留2位小数的浮点数。输出结果如下：

b d e
Utah 0.864358 -0.430545 0.520036
Ohio -1.044976 -0.466799 0.635771
Texas -0.251223 -0.169812 0.530249
Oregon -0.352798 1.006282 0.712523
d:\Softs\python-workspace\pydata_ana\test.py:18: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
res = frame.applymap(my_format)
b d e
Utah 0.86 -0.43 0.52
Ohio -1.04 -0.47 0.64
Texas -0.25 -0.17 0.53
Oregon -0.35 1.01 0.71

通过applymap调用my_format函数格式化输出了相应的值。这里有一个警告，提示用map替代applymap，因为pandas新版不建议使用applymap（被丢弃）。所以我们还可以这样写：

res = frame.map(my_format)

map还可以应用在元素级别：frame["e"].map(my_format)

五、排序和排名（Sorting and Ranking）

按某个标准对数据集进行排序是一个很重要的内置操作。要按行或列标签的字典顺序排序，可以使用 sort_index 方法，该方法返回一个新的排序对象。

import numpy as np
import pandas as pd

obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
obj_sort = obj.sort_index()
print(obj)
print(obj_sort)

排序前后输出如下：

d 0
a 1
b 2
c 3
dtype: int32
a 1
b 2
c 3
d 0
dtype: int32

对于DataFrame，可以在任一轴上按索引排序。

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                     index=["three", "one"], 
                     columns=["d", "a", "b", "c"])
#按行索引排序
sort_row = frame.sort_index()
#按列标签索引排序
sort_column = frame.sort_index(axis="columns")

print(frame)
print(sort_row)
print(sort_column)

输出结果如下：

d a b c
three 0 1 2 3
one 4 5 6 7
d a b c
one 4 5 6 7
three 0 1 2 3
a b c d
three 1 2 3 0
one 5 6 7 4

默认情况下，数据按升序排序，也可以按降序排序。例如：

frame.sort_index(axis="columns", ascending=False)

输出：

d c b a
three 0 3 2 1
one 4 7 6 5

要按值对 Series 进行排序，使用其 sort_values 方法

obj = pd.Series([4, 7, -3, 2])

obj.sort_values()

排序后输出：

2 -3

3 2

0 4

1 7

dtype: int64

默认情况下，对于任何缺失值都会排到序列的末尾。

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])

obj.sort_values()

排序后输出：

4 -3.0

5 2.0

0 4.0

2 7.0

1 NaN

3 NaN

dtype: float64

也可以使用 na_position 参数将缺失值排序到开头：obj.sort_values(na_position="first")。

1 NaN
3 NaN
4 -3.0
5 2.0
0 4.0
2 7.0
dtype: float64

对 DataFrame 进行排序时，可以使用一列或多列中的数据作为排序键。因此，我们可以将一个或多个列名称传递给 sort_values：

import numpy as np
import pandas as pd

frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})

frame1 = frame.sort_values("b")
frame2 = frame.sort_values(["a", "b"])
print(frame)
print(frame1)
print(frame2)

frame输出：

	b	a
0	4	0
1	7	1
2	-3	0
3	2	1

frame.sort_values("b") 输出：

	b	a
2	-3	0
3	2	1
0	4	0
1	7	1

frame.sort_values(["a", "b"]) 输出：

	b	a
2	-3	0
0	4	0
3	2	1
1	7	1

使用rank对Series和DataFrame排名。

默认情况下，从小到大按升序排名，如果有重复值，他们的排名默认取他们的排名平均值。例如：

import numpy as np
import pandas as pd

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj_rank = obj.rank()
print(obj)
print(obj_rank)

series数组obj输出：

0 7
1 -5
2 7
3 4
4 2
5 0
6 4
dtype: int64

排名数组obj_rank输出：

0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64

根据上面的输出解释下：rank()默认从小到大按升序排名，如果有重复值则取这几个重复值得平均排名作为他们的排名，所以obj中 -5最小排名第1，obj中的0排名第2，以此类推。由于obj中有两个4，第一个4排名第4，第二个4排名第5，取排名的平均值（4+5）/ 2 = 4.5，所以这两个4的排名都是4.5。

另外，我们还以通过在rank函数中使用method="first"，这会使得相同的值根据出现顺序指定排名，比如上面obj对象的两个4，先出现的排名第4，后出现的排名第5，不取排名平均。例如：

obj.rank(method="first") 输出：

0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64

还可以按降序进行排序，rank(ascending=False)，默认的情况也是相同值的排名取他们平均值。例如：obj.rank(ascending=False) 输出：

0 1.5
1 7.0
2 1.5
3 3.5
4 5.0
5 6.0
6 3.5
dtype: float64

按降序排列obj中共有7个元素，而且-5最小所以排名第7，0排第6，以此类推。

以下列表提供了rank可用的method方法排名方式：

在DataFrame中进行排名：

import numpy as np
import pandas as pd

frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1], "c": [-2, 5, 8, -2.5]})
frame_rank = frame.rank(axis="columns")

print(frame)
print(frame_rank)

frame输出：

	b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5

frame.rank(axis="columns") 输出：

	b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

六、具有重复标签的轴索引的操作

到目前为止，我们看过的几乎所有例子都有唯一的轴标签（索引值）。虽然许多 pandas 函数（如 reindex）要求标签是唯一的，但这不是必需的。我们来看一个具有重复索引的 Series ：

obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"]) 输出：

a 0
a 1
b 2
b 3
c 4
dtype: int32

可以使用索引的 is_unique 属性判断标签是否唯一。obj.index.is_unique 输出False

使用obj['a'] 会Series：

a 0
a 1
dtype: int32

如果使用obj['c']会输出一个数值4。这可能会使我们的代码更加复杂，因为索引的输出类型可能会根据标签是否重复而有所不同。

相同的逻辑可以扩展到 DataFrame 中的索引行（或列）：

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.standard_normal((5, 3)), 
                  index=["a", "a", "b", "b", "c"])

print(df)
print(df.loc["b"])
print(df.loc["c"])

输出：

0 1 2
a 0.056530 0.342535 -1.499590
a 0.256863 -0.302280 0.188917
b 1.782416 -1.480863 1.460984
b -0.874985 -1.009648 1.495692
c -0.173825 0.619799 -0.582545
0 1 2
b 1.782416 -1.480863 1.460984
b -0.874985 -1.009648 1.495692
0 -0.173825
1 0.619799
2 -0.582545
Name: c, dtype: float64

这次学习的内容有点多了，先到这。下次学习汇总和计算描述性统计量

Python数据分析NumPy和pandas（十三、pandas的数据对齐、数据填充以及DataFrame与Series之间的运算等）

猜你喜欢