一、前言
【Data Mining】机器学习三剑客之Numpy常用用法总结
【Data Mining】机器学习三剑客之Pandas常用用法总结上
二、常用用法总结
1、设置值
1import pandas as pd
2import numpy as np
3
4dates = pd.date_range('20191222', periods=3)
5df = pd.DataFrame(np.arange(12).reshape((3, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
6print(df)
7"""
8 A B C D
92019-12-22 0 1 2 3
102019-12-23 4 5 6 7
112019-12-24 8 9 10 11
12"""
13# index
14df.iloc[1, 1] = 1111
15print(df)
16"""
17 A B C D
182019-12-22 0 1 2 3
192019-12-23 4 1111 6 7
202019-12-24 8 9 10 11
21"""
22# label
23df.loc['20191224', 'C'] = 2222
24print(df)
25"""
26 A B C D
272019-12-22 0 1 2 3
282019-12-23 4 1111 6 7
292019-12-24 8 9 2222 11
30"""
31# mix
32df.ix['20191222', 1] = 3333
33print(df)
34"""
35 A B C D
362019-12-22 0 3333 2 3
372019-12-23 4 1111 6 7
382019-12-24 8 9 2222 11
39"""
40
41
42
43
44dates2 = pd.date_range('20191222', periods=6)
45df2 = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates2, columns=['A', 'B', 'C', 'D'])
46print(df2)
47"""
48 A B C D
492019-12-22 0 1 2 3
502019-12-23 4 5 6 7
512019-12-24 8 9 10 11
522019-12-25 12 13 14 15
532019-12-26 16 17 18 19
542019-12-27 20 21 22 23
55"""
56df2[df2.A > 12] = 0
57print(df2)
58df2.A[df2.A > 8] = 0
59print(df2)
60df2.B[df2.A > 2] = 0
61print(df2)
62# batch processing
63"""
64 A B C D
652019-12-22 0 1 2 3
662019-12-23 4 5 6 7
672019-12-24 8 9 10 11
682019-12-25 12 13 14 15
692019-12-26 0 0 0 0
702019-12-27 0 0 0 0
71 A B C D
722019-12-22 0 1 2 3
732019-12-23 4 5 6 7
742019-12-24 8 9 10 11
752019-12-25 0 13 14 15
762019-12-26 0 0 0 0
772019-12-27 0 0 0 0
78 A B C D
792019-12-22 0 1 2 3
802019-12-23 4 0 6 7
812019-12-24 8 0 10 11
822019-12-25 0 13 14 15
832019-12-26 0 0 0 0
842019-12-27 0 0 0 0
85"""
86# add a colomn
87df2['F'] = np.nan
88print df2
89"""
90 A B C D F
912019-12-22 0 1 2 3 NaN
922019-12-23 4 0 6 7 NaN
932019-12-24 8 0 10 11 NaN
942019-12-25 0 13 14 15 NaN
952019-12-26 0 0 0 0 NaN
962019-12-27 0 0 0 0 NaN
97"""
98# add a colomn using Series
99df2['E'] = pd.Series([1, 2, 3, 4, 5, 6], index=dates2)
100print df2
101"""
102 A B C D F E
1032019-12-22 0 1 2 3 NaN 1
1042019-12-23 4 0 6 7 NaN 2
1052019-12-24 8 0 10 11 NaN 3
1062019-12-25 0 13 14 15 NaN 4
1072019-12-26 0 0 0 0 NaN 5
1082019-12-27 0 0 0 0 NaN 6
109"""
一些总结说明
首先第一种方法,毋庸置疑,猜都能猜到就是之前的选取值的方式之后对于选取的值进行赋值即可,主要有index,label和mix
对于数字数据,通过比较大小来批量处理数据,可以好几行好几列一起处理,也可以单独列或者单独行处理
设置值还包括添加新的一个colomn,也就是添加一个新的series,如果添加的为标量则可以直接一个数字一起赋值,但如果输入过于复杂,我个人建议都采用Series的方式进行添加,所以这就最后归因于你是否掌握了series,方式为dataframe['colomn name'] = pd.Series([])
2、 处理缺省值
1# -*- coding: utf-8 -*-
2import pandas as pd
3import numpy as np
4
5dates = pd.date_range('20191222', periods=4)
6df = pd.DataFrame(np.arange(16).reshape((4, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
7print(df)
8"""
9 A B C D
102019-12-22 0 1 2 3
112019-12-23 4 5 6 7
122019-12-24 8 9 10 11
132019-12-25 12 13 14 15
14"""
15# add nan
16df.iloc[0, 1] = np.nan
17df.iloc[2, 2] = np.nan
18print(df)
19# drop na (index or colomn)
20print(df.dropna(axis=0, how='any'))
21print(df.dropna(axis=1, how='any'))
22"""
23 A B C D
242019-12-22 0 NaN 2.0 3
252019-12-23 4 5.0 6.0 7
262019-12-24 8 9.0 NaN 11
272019-12-25 12 13.0 14.0 15
28 A B C D
292019-12-23 4 5.0 6.0 7
302019-12-25 12 13.0 14.0 15
31 A D
322019-12-22 0 3
332019-12-23 4 7
342019-12-24 8 11
352019-12-25 12 15
36"""
37
38# 前后对比
39print(df.dropna(axis=1, how='all'))
40"""
41 A B C D
422019-12-22 0 NaN 2.0 3
432019-12-23 4 5.0 6.0 7
442019-12-24 8 9.0 NaN 11
452019-12-25 12 13.0 14.0 15
46"""
47df.iloc[1, 1] = np.nan
48df.iloc[2, 1] = np.nan
49df.iloc[3, 1] = np.nan
50print(df)
51print(df.dropna(axis=1, how='all'))
52"""
53 A B C D
542019-12-22 0 NaN 2.0 3
552019-12-23 4 NaN 6.0 7
562019-12-24 8 NaN NaN 11
572019-12-25 12 NaN 14.0 15
58 A C D
592019-12-22 0 2.0 3
602019-12-23 4 6.0 7
612019-12-24 8 NaN 11
622019-12-25 12 14.0 15
63"""
64# fillna has many parameters
65# please see details
66# 把nan复制成value
67print(df.fillna(value=222.22))
68"""
69 A B C D
702019-12-22 0 222.22 2.00 3
712019-12-23 4 222.22 6.00 7
722019-12-24 8 222.22 222.22 11
732019-12-25 12 222.22 14.00 15
74"""
75# judge nan
76# 一般表格很大的时候用
77# nan return True
78print(df.isnull())
79# 如果有一个值为df.isnull 中有一个为True,则返回True
80# 一起使用,则为判断这个大的dataframe中是否含有nan
81print(np.any(df.isnull()))
82
83print(np.any(df.isnull()) == True)
84
85"""
86 A B C D
872019-12-22 False True False False
882019-12-23 False True False False
892019-12-24 False True True False
902019-12-25 False True False False
91True
92True
93"""
删除nan,all全部为nan则删,any有一个nan则删,删index or colomn看axis
填充nan一个value
超级大的dataframe判断是否有nan
3、 合并concat
①
1# -*- coding: utf-8 -*-
2import pandas as pd
3import numpy as np
4
5df1 = pd.DataFrame(np.ones((2, 3))*0, columns=['a', 'b', 'c'])
6df2 = pd.DataFrame(np.ones((2, 3))*1, columns=['a', 'b', 'c'])
7df3 = pd.DataFrame(np.ones((2, 3))*22, columns=['a', 'b', 'c'])
8print(df1)
9print(df2)
10print(df3)
11"""
12 a b c
130 0.0 0.0 0.0
141 0.0 0.0 0.0
15 a b c
160 1.0 1.0 1.0
171 1.0 1.0 1.0
18 a b c
190 22.0 22.0 22.0
201 22.0 22.0 22.0
21"""
22# 三个进行合并 index也是最初的组合而已,此处axis=0为vertical合并
23df_vertical = pd.concat([df1, df2, df3], axis=0)
24print(df_vertical)
25"""
26 a b c
270 0.0 0.0 0.0
281 0.0 0.0 0.0
290 1.0 1.0 1.0
301 1.0 1.0 1.0
310 22.0 22.0 22.0
321 22.0 22.0 22.0
33"""
34#进行index的重新洗牌
35df_vertical_rightindex = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
36print(df_vertical_rightindex)
37"""
38 a b c
390 0.0 0.0 0.0
401 0.0 0.0 0.0
412 1.0 1.0 1.0
423 1.0 1.0 1.0
434 22.0 22.0 22.0
445 22.0 22.0 22.0
45"""
46# horizontal
47df_horizontal = pd.concat([df1, df2, df3], axis=1)
48print(df_horizontal)
49"""
50 a b c a b c a b c
510 0.0 0.0 0.0 1.0 1.0 1.0 22.0 22.0 22.0
521 0.0 0.0 0.0 1.0 1.0 1.0 22.0 22.0 22.0
53"""
54df_horizontal_rightindex = pd.concat([df1, df2, df3], axis=1, ignore_index=True)
55print(df_horizontal_rightindex)
56"""
57 0 1 2 3 4 5 6 7 8
580 0.0 0.0 0.0 1.0 1.0 1.0 22.0 22.0 22.0
591 0.0 0.0 0.0 1.0 1.0 1.0 22.0 22.0 22.0
60"""
61df_horizontal_rightindex.rename(
62 columns={0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e',
63 5: 'f', 6: 'g', 7: 'h', 8: 'i', 9: 'g'}, inplace=True)
64print(df_horizontal_rightindex)
65"""
66 a b c d e f g h i
670 0.0 0.0 0.0 1.0 1.0 1.0 22.0 22.0 22.0
681 0.0 0.0 0.0 1.0 1.0 1.0 22.0 22.0 22.0
69"""
一些总结:
axis=0 为纵向合并,axis=1为横向合并
纵向合并时,想重新排序index,采用ignore_index=True参数即可
横向合并时,相当于把column作为index,所以采用ignore_index=True操作最终结果是所有的column变为0,1,2,3….,所以需要rename(columns={})重新命名即可
②
1 # different colomns label and index label,need to use 'join' and 'join_axes' parameters
2df4 = pd.DataFrame(np.ones((3, 4))*0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
3df5 = pd.DataFrame(np.ones((3, 4))*1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
4print(df4)
5print(df5)
6"""
7 a b c d
81 0.0 0.0 0.0 0.0
92 0.0 0.0 0.0 0.0
103 0.0 0.0 0.0 0.0
11
12 b c d e
132 1.0 1.0 1.0 1.0
143 1.0 1.0 1.0 1.0
154 1.0 1.0 1.0 1.0
16"""
17# 不存在的用nan填充
18df_outer = pd.concat([df4, df5], join='outer', ignore_index=True)
19print(df_outer)
20"""
21 a b c d e
220 0.0 0.0 0.0 0.0 NaN
231 0.0 0.0 0.0 0.0 NaN
242 0.0 0.0 0.0 0.0 NaN
253 NaN 1.0 1.0 1.0 1.0
264 NaN 1.0 1.0 1.0 1.0
275 NaN 1.0 1.0 1.0 1.0
28"""
29df_inner = pd.concat([df4, df5], join='inner')
30print(df_inner)
31"""
32 b c d
331 0.0 0.0 0.0
342 0.0 0.0 0.0
353 0.0 0.0 0.0
362 1.0 1.0 1.0
373 1.0 1.0 1.0
384 1.0 1.0 1.0
39"""
40df_inner_index = pd.concat([df4, df5], join='inner', ignore_index=True)
41print(df_inner_index)
42"""
43 b c d
440 0.0 0.0 0.0
451 0.0 0.0 0.0
462 0.0 0.0 0.0
473 1.0 1.0 1.0
484 1.0 1.0 1.0
495 1.0 1.0 1.0
50"""
51
52
53"""
54原df4,df5 便于观看
55 a b c d
561 0.0 0.0 0.0 0.0
572 0.0 0.0 0.0 0.0
583 0.0 0.0 0.0 0.0
59
60 b c d e
612 1.0 1.0 1.0 1.0
623 1.0 1.0 1.0 1.0
634 1.0 1.0 1.0 1.0
64"""
65# 不存在的用nan填充
66df_h_different_index = pd.concat([df4, df5], axis=1)
67print(df_h_different_index)
68"""
69 a b c d b c d e
701 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
712 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
723 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
734 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
74"""
75#用df4的index作为合并之后的index,所以不是df4的index部分删掉(4删掉)
76df_h_different_index_df4 = pd.concat([df4, df5], axis=1, join_axes=[df4.index])
77print(df_h_different_index_df4)
78"""
79 a b c d b c d e
801 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
812 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
823 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
83"""
84
85df_h_different_index_df5 = pd.concat([df4, df5], axis=1, join_axes=[df5.index])
86print(df_h_different_index_df5)
87"""
88 a b c d b c d e
892 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
903 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
914 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
92"""
一些总结:
当index和column label不相同时合并
纵向合并用join=['outer','inner'],横向合并用join_axes=[df.index]
当纵向时,outer为全留下,没有值的地方为nan,inner为只要相同的colomn label处
当横向时,不采用join_axes和join='outer'效果相似,如果采用join_axes则以对应的index为最后的index,其余删掉
4、 合并append
1import numpy as np
2import pandas as pd
3df1 = pd.DataFrame(np.ones((2, 3))*0, columns=['a', 'b', 'c'])
4df2 = pd.DataFrame(np.ones((2, 3))*11, columns=['a', 'b', 'c'])
5df3 = pd.DataFrame(np.ones((2, 3))*2, columns=['a', 'b', 'c'])
6print(df1)
7print(df2)
8print(df3)
9"""
10 a b c
110 0.0 0.0 0.0
121 0.0 0.0 0.0
13 a b c
140 11.0 11.0 11.0
151 11.0 11.0 11.0
16 a b c
170 2.0 2.0 2.0
181 2.0 2.0 2.0
19"""
20df_append1 = df1.append(df2, ignore_index=True)
21print(df_append1)
22"""
23 a b c
240 0.0 0.0 0.0
251 0.0 0.0 0.0
262 11.0 11.0 11.0
273 11.0 11.0 11.0
28"""
29df_append2 = df1.append([df2, df3], ignore_index=True)
30print(df_append2)
31"""
32 a b c
330 0.0 0.0 0.0
341 0.0 0.0 0.0
352 11.0 11.0 11.0
363 11.0 11.0 11.0
374 2.0 2.0 2.0
385 2.0 2.0 2.0
39"""
40s1 = pd.Series([12, 24, 33], index=['a', 'b', 'c'])
41df_appned_s = df1.append(s1, ignore_index=True)
42print(df_appned_s)
43"""
44 a b c
450 0.0 0.0 0.0
461 0.0 0.0 0.0
472 12.0 24.0 33.0
48"""
一些总结:
append是python中list常用的功能,这里append相当于dataframe在纵向方向上加数据
series相当于dataframe的一行,append一个series相当于在下面加上了一行
5、 合并merge
1. on对应的column中的value完全相同,并且只针对一个column
1# -*- coding: utf-8 -*-
2import pandas as pd
3
4left = pd.DataFrame({'connect': ['con0', 'con1', 'con2', 'con3'],
5 'A': ['A0', 'A1', 'A2', 'A3'],
6 'B': ['B0', 'B1', 'B2', 'B3']})
7right = pd.DataFrame({'connect' : ['con0', 'con1', 'con2', 'con3'],
8 'C': ['C0', 'C1', 'C2', 'C3'],
9 'D': ['D0', 'D1', 'D2', 'D3']})
10
11print(left)
12print(right)
13"""
14 A B connect
150 A0 B0 con0
161 A1 B1 con1
172 A2 B2 con2
183 A3 B3 con3
19 C D connect
200 C0 D0 con0
211 C1 D1 con1
222 C2 D2 con2
233 C3 D3 con3
24"""
25
26
27# 两个dataframe通过相同的column,merge在一起
28# column label为connect上的元素全部一样的一般情况
29merge_l_r_same = pd.merge(left=left, right=right, on='connect')
30print(merge_l_r_same)
31merge_l_r_same = pd.merge(left=right, right=left, on='connect')
32print(merge_l_r_same)
33"""
34 A B connect C D
350 A0 B0 con0 C0 D0
361 A1 B1 con1 C1 D1
372 A2 B2 con2 C2 D2
383 A3 B3 con3 C3 D3
39
40 C D connect A B
410 C0 D0 con0 A0 B0
421 C1 D1 con1 A1 B1
432 C2 D2 con2 A2 B2
443 C3 D3 con3 A3 B3
45
46通过这个对比可以明白参数on, 以及left, right的功能
47left,right即通过on参数指定的column来进行合并而
48left对应的dataframe剩下的column则全都依次放在left
49right对应的dataframe剩下的column则全都依次放在right
50"""
2. on针对多个column,并且对应的column中的value不相同
1# on多个column 并且每个column label对应的value有重叠但不完全相同.
2
3left = pd.DataFrame({'key1': ['K1', 'K0', 'K1', 'K2'],
4 'key2': ['K0', 'K0', 'K1', 'K1'],
5 'A': ['A0', 'A1', 'A2', 'A3'],
6 'B': ['B0', 'B1', 'B2', 'B3']})
7right = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K1'],
8 'key2': ['K0', 'K1', 'K1', 'K1'],
9 'C': ['C0', 'C1', 'C2', 'C3'],
10 'D': ['D0', 'D1', 'D2', 'D3']})
11print(left)
12print(right)
13"""
14 A B key1 key2
150 A0 B0 K1 K0
161 A1 B1 K0 K0
172 A2 B2 K1 K1
183 A3 B3 K2 K1
19
20 C D key1 key2
210 C0 D0 K0 K0
221 C1 D1 K0 K1
232 C2 D2 K1 K1
243 C3 D3 K1 K1
25"""
26# on对应于多个column时需要采用List
27# how=['inner', 'outer', 'left', 'right']
28# default = 'inner'
29merge_l_r_different = pd.merge(left, right, on=['key1', 'key2'])
30print(merge_l_r_different)
31"""
32 A B key1 key2 C D
330 A1 B1 K0 K0 C0 D0
341 A2 B2 K1 K1 C2 D2
352 A2 B2 K1 K1 C3 D3
36"""
37merge_inner = pd.merge(left, right, on=['key1', 'key2'], how='inner')
38print(merge_inner)
39"""
40 A B key1 key2 C D
410 A1 B1 K0 K0 C0 D0
421 A2 B2 K1 K1 C2 D2
432 A2 B2 K1 K1 C3 D3
44"""
45
46"""
47便于观看
48left:
49 A B key1 key2
500 A0 B0 K1 K0
511 A1 B1 K0 K0
522 A2 B2 K1 K1
533 A3 B3 K2 K1
54right:
55 C D key1 key2
560 C0 D0 K0 K0
571 C1 D1 K0 K1
582 C2 D2 K1 K1
593 C3 D3 K1 K1
60"""
61
62merge_outer = pd.merge(left, right, on=['key1', 'key2'], how='outer')
63print(merge_outer)
64"""
65 A B key1 key2 C D
660 A0 B0 K1 K0 NaN NaN
671 A1 B1 K0 K0 C0 D0
682 A2 B2 K1 K1 C2 D2
693 A2 B2 K1 K1 C3 D3
704 A3 B3 K2 K1 NaN NaN
715 NaN NaN K0 K1 C1 D1
72"""
73merge_left = pd.merge(left, right, on=['key1', 'key2'], how='left')
74print(merge_left)
75"""
76 A B key1 key2 C D
770 A0 B0 K1 K0 NaN NaN
781 A1 B1 K0 K0 C0 D0
792 A2 B2 K1 K1 C2 D2
803 A2 B2 K1 K1 C3 D3
814 A3 B3 K2 K1 NaN NaN
82"""
83
84"""
85便于观看
86left:
87 A B key1 key2
880 A0 B0 K1 K0
891 A1 B1 K0 K0
902 A2 B2 K1 K1
913 A3 B3 K2 K1
92right:
93 C D key1 key2
940 C0 D0 K0 K0
951 C1 D1 K0 K1
962 C2 D2 K1 K1
973 C3 D3 K1 K1
98"""
99merge_right = pd.merge(left, right, on=['key1', 'key2'], how='right')
100print(merge_right)
101"""
102 A B key1 key2 C D
1030 A1 B1 K0 K0 C0 D0
1041 A2 B2 K1 K1 C2 D2
1052 A2 B2 K1 K1 C3 D3
1063 NaN NaN K0 K1 C1 D1
107"""
这里主要是考察merge函数中的how参数的用法,以及对应的operations
这里举两个例子主要分析
1"""
2 A B key1 key2
30 A0 B0 K1 K0
41 A1 B1 K0 K0
52 A2 B2 K1 K1
63 A3 B3 K2 K1
7
8 C D key1 key2
90 C0 D0 K0 K0
101 C1 D1 K0 K1
112 C2 D2 K1 K1
123 C3 D3 K1 K1
13"""
14result:
15"""
16 A B key1 key2 C D
170 A1 B1 K0 K0 C0 D0
181 A2 B2 K1 K1 C2 D2
192 A2 B2 K1 K1 C3 D3
20"""
21
22进行on=['key1, key2'], how='inner'
23
24请试着和我一起分析:
25首先因为通过key1和key2两个column进行合并,
26观察两列中的[key1,key2]组合完全相同的行
27可以明显观察到的是 [K0,K0]以及[K1,K1]相同,
28因为采用inner方式所以只考虑相同的行即可,
29
30首先取出[K0,K0]中的left的元素放置于左边,
31对应的right的value放置于右边,接下来一样的操作
32对应left中的[K1, K1]有两个right对应,
33首先先将[K1, K1]的left的元素放置于左边,
34
35之后去取right的中的元素放置于右面,
36而对于right中的[K1, K1]有两个放置于右边,
37所以左边还需补一组一模一样的left中对应的数据放置于左面。
38如果how='outer'的话则全部的columns都留下,
39只是left和right中没有对应的部分,为nan即可,
40可返回到代码部分重复观看,即可明白。
41"""
42 A B key1 key2 C D
430 A0 B0 K1 K0 NaN NaN
441 A1 B1 K0 K0 C0 D0
452 A2 B2 K1 K1 C2 D2
463 A2 B2 K1 K1 C3 D3
474 A3 B3 K2 K1 NaN NaN
48"""
49how='left'的意思才估计也能猜到了,
50就是主要针对于left的所有的[key1, key2]组合,
51right只是迎合left的结构。分析如下:
52
53首先left中的[K1,K0]对应于的数据放在左面,
54而right中不存在则为nan nan,之后[K0,K0]存在对应的right元素,
55[K1,K1]和上面的分析基本一致,最后到[K2, K1]和[K1, K0]一个意思。
56
57大家可以自行根据以上分析来分析,'outer'和'right'情况
3.indicator
1import pandas as pd
2
3# the parameter of indicator
4left = pd.DataFrame({'key': [0, 1], 'left': ['a', 'b']})
5right = pd.DataFrame({'key': [1, 2, 2], 'right': [2, 2, 2]})
6print(left)
7print(right)
8"""
9 key left
100 0 a
111 1 b
12 key right
130 1 2
141 2 2
152 2 2
16"""
17res_indicator = pd.merge(left=left, right=right, on='key', how='outer', indicator=True)
18print(res_indicator)
19"""
20 key left right _merge
210 0 a NaN left_only
221 1 b 2.0 both
232 2 NaN 2.0 right_only
243 2 NaN 2.0 right_only
25"""
26# give indicater a name that is "dict_name"
27# the default name is "_merge"
28res_indicator2 = pd.merge(left, right, on='key', how='outer', indicator="idct_name")
29print(res_indicator2)
30"""
31 key left right idct_name
320 0 a NaN left_only
331 1 b 2.0 both
342 2 NaN 2.0 right_only
353 2 NaN 2.0 right_only
36"""
merge的parameter “indicator”主要作用为显示出最终的left和right一起merge的结果,为left_only:这一index只有left有value,而right无value,both则都有,right_only和left_only相似。
4.利用index进行merge(left_index,right_index),配合使用(left_on, right_on)
1# define the indexes of left and right again
2left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
3 'B': ['B0', 'B1', 'B2']},
4 index=['K0', 'K1', 'K2'])
5
6right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
7 'D': ['D0', 'D2', 'D3']},
8 index=['K0', 'K2', 'K3'])
9print(left)
10print(right)
11"""
12 A B
13K0 A0 B0
14K1 A1 B1
15K2 A2 B2
16 C D
17K0 C0 D0
18K2 C2 D2
19K3 C3 D3
20"""
21# classification of (left_index, right_index)
22res = pd.merge(left, right, left_index=True, right_index=True, how='outer')
23print res
24"""
25 A B C D
26K0 A0 B0 C0 D0
27K1 A1 B1 NaN NaN
28K2 A2 B2 C2 D2
29K3 NaN NaN C3 D3
30"""
31# classification of (left_index, right_on)
32res = pd.merge(left, right, left_index=True, how='outer', right_on='D')
33print res
34"""
35 A B C D
36K3 A0 B0 NaN K0
37K3 A1 B1 NaN K1
38K3 A2 B2 NaN K2
39K0 NaN NaN C0 D0
40K2 NaN NaN C2 D2
41K3 NaN NaN C3 D3
42"""
首先这里一定要掌握dataframe的column和index的区别否则容易晕
其实left_index和right_index就是和on参数意义一样只是指明使用left dataframe的index和right dataframe的index进行merge
index和column也可以一起merge on使用,例如代码中的left_index和right_on
5、对于重叠(overlapping)
1boys = pd.DataFrame({'people': ['name1', 'name2', 'name3'], 'age': [1, 2, 3]})
2girls = pd.DataFrame({'people': ['name1', 'name4', 'name5'], 'age': [4, 5, 6]})
3print(boys)
4print(girls)
5"""
6 age people
70 1 name1
81 2 name2
92 3 name3
10 age people
110 4 name1
121 5 name4
132 6 name5
14"""
15res = pd.merge(boys, girls, on='people', suffixes=['boys', 'girls'], how='inner')
16print(res)
17"""
18 ageboys people agegirls
190 1 name1 4
20"""
场景为: 统计男生的name和age,女生的name和age,名字的列都用的people,最后两个表格合并的时候,发现有两个重名的同学,所以不能看作一个,作为overlapping的问题解决方案是, 利用suffixes参数进行age的重命名为agegirls, ageboys.