Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right
Article Directory
foreword
A DataFrame is a tabular data structure that contains an ordered set of columns that can be of a different value type (numeric, string, boolean, etc.). DataFrame has both row and column indexes.
提示:以下是本篇文章正文内容,下面案例可供参考
1. DataFrame creation
1. Function creation
code show as below:
import pandas as pd
import numpy as np
frame=pd.DataFrame(np.random.randn(3,3),index=list('abc'),columns=list('ABC'))
frame
Output result:
A B C
a -0.391570 0.182729 1.010572
b 0.455405 0.418206 0.134341
c -0.491456 -0.527641 0.868909
2. Create directly
code show as below:
import pandas as pd
import numpy as np
frame= pd.DataFrame([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]],
index=list('abc'), columns=list('ABC'))
frame
#可以分别定义列索引(columns)与行切片(index)
frame1=pd.DataFrame([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
frame1.columns=list('ABC')
frame1.index=list('abc')
frame1
Output result:
>>frame
A B C
a 1 2 3
b 2 3 4
c 3 4 5
>>frame1
A B C
a 1 2 3
b 2 3 4
c 3 4 5
3. Dictionary creation
code show as below:
import pandas as pd
data={
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
frame=pd.DataFrame(data)
frame
Output result:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
Two, DataFrame properties
1. View the data type of the column
- Use "DataFrame.dtypes" to see column data types
code show as below:
frame.dtypes
Output result:
A float64
B float64
C float64
dtype: object
2. View the first few lines and the last few lines of the DataFrame
- Use "head()" to view the data of the first few rows, the default is the first 5 rows, and the parameters can also be set by yourself.
- Use "tail()" to view the data of the next few lines, the default is the last 5 lines, and the parameters can also be set by yourself.
The default is the first 5 lines.
The code is as follows:
frame = pd.DataFrame(np.arange(36).reshape(6, 6), index=list('abcdef'), columns=list('ABCDEF'))
frame.head() #默认是前5行
Output result:
A B C D E F
a 0 1 2 3 4 5
b 6 7 8 9 10 11
c 12 13 14 15 16 17
d 18 19 20 21 22 23
e 24 25 26 27 28 29
The first 2 lines
of code are as follows:
frame.head(2)
Output result:
A B C D E F
a 0 1 2 3 4 5
b 6 7 8 9 10 11
The last 5 lines
of code are as follows by default:
frame.tail()
Output result:
A B C D E F
b 6 7 8 9 10 11
c 12 13 14 15 16 17
d 18 19 20 21 22 23
e 24 25 26 27 28 29
f 30 31 32 33 34 35
The last 2 lines
of code are as follows:
frame.tail(2)
Output result:
A B C D E F
e 24 25 26 27 28 29
f 30 31 32 33 34 35
3. View row and column names
- Use "DataFrame.columns" to see column names
code show as below:
frame.columns ##查看列名
Output result:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
- Use "DataFrame.index" to see the row names
code show as below:
frame.index ##查看行名
Output result:
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')
4. Check the data value
- Use "values" to view the data values in the DataFrame, which returns an array.
code show as below:
frame.values
Output result:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
- View all data values in a column
code show as below:
print(frame['B'].values)
Output result:
[ 1 7 13 19 25 31]
- View all data values of a row,
- Use iloc to view the data value, according to the number index (that is, the line number, prompt: 0 starts, representing the first line.);
- Use loc to look at data values, indexed by row name.
code show as below:
frame.iloc[0]
frame.loc['a']
Output result:
A 0
B 1
C 2
D 3
E 4
F 5
5. View the number of rows and columns
- Use shape to view the number of rows and columns. The parameter is 0 to view the row, and the parameter is 1 to view the number of columns.
code show as below:
frame.shape[0]
frame.shape[1]
Output result:
6
6
3. DataFrame slicing and indexing
Slice representation is a row slice; index representation is a column index
OK
- Slicing with colons
- With loc, iloc
code show as below:
#使用冒号进行切片
>> frame['a':'b']
> A B C D E F
a 0 1 2 3 4 5
b 6 7 8 9 10 11
#借助loc,iloc
#loc
>>frame.loc['a':'c','A':'C'] # ':',切片
> A B C
a 0 1 2
b 6 7 8
c 12 13 14
>>frame.loc[['a','b'],['A','C']] # '[]', 索引特定行列
> A C
a 0 2
b 6 8
#iloc
>>frame.iloc[1:] # 行切片,取第2行之后所有行
> A B C D E F
b 6 7 8 9 10 11
c 12 13 14 15 16 17
d 18 19 20 21 22 23
e 24 25 26 27 28 29
f 30 31 32 33 34 35
>>frame[frame['B']==13].index #显示所有的行名
> Index(['c'], dtype='object')
List
- Can be directly based on the column name.
- Use loc/iloc
code show as below:
>>frame['A'] #取名为‘A‘的列
> a 0
b 6
c 12
d 18
e 24
f 30
>>frame.loc[:,'A':'C'] #取A-C列
> A B C
a 0 1 2
b 6 7 8
c 12 13 14
d 18 19 20
e 24 25 26
f 30 31 32
>>frame.iloc[:,1] #取第二列
> a 1
b 7
c 13
d 19
e 25
f 31
row+column
code show as below:
>> frame.iloc[1:,-2:] #行:第二行开始 列:倒数第二列开始
> E F
b 10 11
c 16 17
d 22 23
e 28 29
f 34 35
>> frame[frame['A']>7] #A值大于7的所有行
> A B C D E F
c 12 13 14 15 16 17
d 18 19 20 21 22 23
e 24 25 26 27 28 29
f 30 31 32 33 34 35
>> frame['B'][frame['A']>7] # A>7的所有行的'B'信息
> c 13
d 19
e 25
f 31
Name: B, dtype: int32
Four, DataFrame operation
1. Transpose
- Use the letter ".T"
code show as below:
frame.T
Output result:
a b c d e f
A 0 6 12 18 24 30
B 1 7 13 19 25 31
C 2 8 14 20 26 32
D 3 9 15 21 27 33
E 4 10 16 22 28 34
F 5 11 17 23 29 35
2. Descriptive statistics
- Use "describe()" to perform descriptive statistics on data according to columns. If some columns are non-numeric, statistics will not be performed. If you want to perform descriptive statistics on rows, perform "describe()" after transposing
code show as below:
frame.describe()
Output result:
A B C D E F
count 6.000000 6.000000 6.000000 6.000000 6.000000 6.000000
mean 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000
std 11.224972 11.224972 11.224972 11.224972 11.224972 11.224972
min 0.000000 1.000000 2.000000 3.000000 4.000000 5.000000
25% 7.500000 8.500000 9.500000 10.500000 11.500000 12.500000
50% 15.000000 16.000000 17.000000 18.000000 19.000000 20.000000
75% 22.500000 23.500000 24.500000 25.500000 26.500000 27.500000
max 30.000000 31.000000 32.000000 33.000000 34.000000 35.000000
3. Calculate
arithmetic operation
- add(other) mathematical operation plus a specific number
code show as below:
frame['A'].add(100)
Output result:
a 100
b 106
c 112
d 118
e 124
f 130
- sub(other) Find the data difference of two columns
code show as below:
frame['A-B‘]=frame['A'].sub(frame['B'])
frame
Output result:
A B C D E F A-B
a 0 1 2 3 4 5 -1
b 6 7 8 9 10 11 -1
c 12 13 14 15 16 17 -1
d 18 19 20 21 22 23 -1
e 24 25 26 27 28 29 -1
f 30 31 32 33 34 35 -1
- round(other) : retain the number of decimal places
Keep two decimal
places The code is as follows:
frame2=pd.DataFrame({
'col1':[1.234,2.34,4.5678],'col2':[1.0987,0.9876,3.45]}) #
frame2.round(2)
Output result:
col1 col2
0 1.23 1.10
1 2.34 0.99
2 4.57 3.45
Different columns specify different decimal places.
The code is as follows:
frame2.round({
'col1':1,'col2':2})
Output result:
col1 col2
0 1.2 1.10
1 2.3 0.99
2 4.6 3.45
logic operation
Logical operators < , > , | , &
- Logical operation type: >,>=,<,<=,==,!=
- Compound logical operations: &, |, ~, (and, or, not)
The code to filter the data of B>8
is as follows:
frame['B']>2 #返回逻辑结果
Output result:
a False
b True
c True
d True
e True
f True
Name: B, dtype: bool
The results of logical filtering are used as the basis for filtering.
The code is as follows:
frame[frame['B']>2]
Output result:
A B C D E F A-B
b 6 7 8 9 10 11 -1
c 12 13 14 15 16 17 -1
d 18 19 20 21 22 23 -1
e 24 25 26 27 28 29 -1
f 30 31 32 33 34 35 -1
One or more logical judgments, screening B>8 and C>10
codes are as follows:
frame[(frame['B']>8)& (frame['C']>10)]
Output result:
A B C D E F A-B
c 12 13 14 15 16 17 -1
d 18 19 20 21 22 23 -1
e 24 25 26 27 28 29 -1
f 30 31 32 33 34 35 -1
logical operation function
- DataFrame.query() ##Get the result data directly
- query(expr)
- expr: query string
- query(expr)
- DataFrame.B.isin([3,6,4]) ##Generate bool series, also need index to get data
Use query to make "frame[(frame['B']>8)&(frame['C']>10)]" more convenient and simple. The
code is as follows:
frame.query("B>2 & C>10")
Output result:
A B C D E F A-B
c 12 13 14 15 16 17 -1
d 18 19 20 21 22 23 -1
e 24 25 26 27 28 29 -1
f 30 31 32 33 34 35 -1
- isin(values)
The code to judge whether C is 20, 26, or 32
is as follows:
frame[frame['C'].isin([20,26,32])]
Output result:
A B C D E F A-B
d 18 19 20 21 22 23 -1
e 24 25 26 27 28 29 -1
f 30 31 32 33 34 35 -1
statistical function
When performing statistics on a single function, the coordinate axis still uses axis=0 for each column by default . If you want to use rows, you need to specify axis=1.
- count() : number of non-NA observations, count the number of non-NA observations.
- sum() : Sum up. The default is to sum each column, "sum(1)" is to sum each row
- mean() : mean value.
- median() : Median.
- min() : minimum value.
- max() : the maximum value.
- mode() :
- abs() : absolute value
- std() : standard deviation
- var() : mean squared error
code show as below:
frame.sum()#对每列求和
Output result:
A 90
B 96
C 102
D 108
E 114
F 120
A-B -6
code show as below:
frame.sum(1)#每行求和
Output result:
a 14
b 50
c 86
d 122
e 158
f 194
dtype: int64
code show as below:
frame.count()#统计非空数量
Output result:
A 6
B 6
C 6
D 6
E 6
F 6
A-B 6
dtype: int64
code show as below:
frame.count(1)#统计每行非空数量
Output result:
a 7
b 7
c 7
d 7
e 7
f 7
dtype: int64
cumulative statistics function
- cumsum() calculates the sum of the first 1/2/3/.../n numbers
- cummax() calculates the maximum value of the first 1/2/3/.../n numbers
- cummin() calculates the minimum value of the first 1/2/3/.../n numbers
- cumprod() calculates the product of the first 1/2/3/.../n numbers
code show as below:
frame.cumsum()
Output result:
A B C D E F A-B
a 0 1 2 3 4 5 -1
b 6 8 10 12 14 16 -2
c 18 21 24 27 30 33 -3
d 36 40 44 48 52 56 -4
e 60 65 70 75 80 85 -5
f 90 96 102 108 114 120 -6
custom calculation
- Use "DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)" for multiplication
- func : function
- axis : 0 refers to the row, 1 refers to the column, the default is row.
Define the cumulative summation
code as follows:
frame.apply(np.cumsum,axis=0,result_type=None)
Output result:
A B C D E F A-B
a 0 1 2 3 4 5 -1
b 6 8 10 12 14 16 -2
c 18 21 24 27 30 33 -3
d 36 40 44 48 52 56 -4
e 60 65 70 75 80 85 -5
f 90 96 102 108 114 120 -6
Define a column, maximum-minimum function
code as follows:
frame[['A','B']].apply(lambda x : x.max()-x.min())
Output result:
A 30
B 30
dtype: int64
The code to define the multiplication operation for a certain column
is as follows:
frame[['A','B']].apply(lambda x:x*2))
Output result:
A B
a 0 2
b 12 14
c 24 26
d 36 38
e 48 50
f 60 62
4. Add
List
- add empty column
code show as below:
frame['price']=''
Output result:
A B C D E F A-B price
a 0 1 2 3 4 5 -1
b 6 7 8 9 10 11 -1
c 12 13 14 15 16 17 -1
d 18 19 20 21 22 23 -1
e 24 25 26 27 28 29 -1
f 30 31 32 33 34 35 -1
code show as below:
frame['price'] = pd.Series(dtype='int',index=['a','b','c','d','e','f'])
frame['price']=0
Output result:
A B C D E F A-B price
a 0 1 2 3 4 5 -1 0
b 6 7 8 9 10 11 -1 0
c 12 13 14 15 16 17 -1 0
d 18 19 20 21 22 23 -1 0
e 24 25 26 27 28 29 -1 0
f 30 31 32 33 34 35 -1 0
- Add/insert column at specified position
The extended column can be directly like a dictionary, and the column name corresponds to a list. Note that the length of the list must be the same as that of the index.
code show as below:
frame['G']=['999','999','999','999','999','999']
Output result:
A B C D E F A-B price G
a 0 1 2 3 4 5 -1 0 999
b 6 7 8 9 10 11 -1 0 999
c 12 13 14 15 16 17 -1 0 999
d 18 19 20 21 22 23 -1 0 999
e 24 25 26 27 28 29 -1 0 999
f 30 31 32 33 34 35 -1 0 999
- If there is a requirement for the index order, use Series to add.
Note: If you use Series initialization, you must specify index , because its default index is 0, 1, 2... If your dataframe index is not, it will all be initialized to NaN.
code show as below:
frame['H']=pd.Series([1,2,3])
frame
Output result:
A B C D E F A-B price G H
a 0 1 2 3 4 5 -1 0 999 NaN
b 6 7 8 9 10 11 -1 0 999 NaN
c 12 13 14 15 16 17 -1 0 999 NaN
d 18 19 20 21 22 23 -1 0 999 NaN
e 24 25 26 27 28 29 -1 0 999 NaN
f 30 31 32 33 34 35 -1 0 999 NaN
code show as below:
frame['H']=pd.Series([1,2,3,4,5,6],index=['a','b','c','d','e','f'])
frame
Output result:
A B C D E F A-B price G H
a 0 1 2 3 4 5 -1 0 999 1
b 6 7 8 9 10 11 -1 0 999 2
c 12 13 14 15 16 17 -1 0 999 3
d 18 19 20 21 22 23 -1 0 999 4
e 24 25 26 27 28 29 -1 0 999 5
f 30 31 32 33 34 35 -1 0 999 6
- Using insert, this method can be used to specify which column to insert the column into, and the other columns will be postponed.
code show as below:
## 将列名为“QQ”,数值['999','999','999','999','999','999']插入到第一列,其他列顺延。
frame.insert(0, 'QQ', ['999','999','999','999','999','999'])
Output result:
QQ A B C D E F A-B price G
a 999 0 1 2 3 4 5 -1 0 999
b 999 6 7 8 9 10 11 -1 0 999
c 999 12 13 14 15 16 17 -1 0 999
d 999 18 19 20 21 22 23 -1 0 999
e 999 24 25 26 27 28 29 -1 0 999
f 999 30 31 32 33 34 35 -1 0 999
OK
- add line
Use loc to directly assign a new row
The code is as follows:
new_data_list=['666','999','555','3','4','8','0','0','1']
frame.loc[6]=new_data_list
frame
Output result:
B C D E F A-B price H J
a 1 2 3 4 5 -1 0 1 1
b 7 8 9 10 11 -1 0 2 2
c 13 14 15 16 17 -1 0 3 3
d 19 20 21 22 23 -1 0 4 4
e 25 26 27 28 29 -1 0 5 5
f 31 32 33 34 35 -1 0 6 6
6 666 999 555 3 4 8 0 0 1
Use the label of loc to directly assign a new line.
The code is as follows:
frame.loc['g']=['666','999','555','3','4','8','0','0','1']
Output result:
B C D E F A-B price H J
a 1 2 3 4 5 -1 0 1 1
b 7 8 9 10 11 -1 0 2 2
c 13 14 15 16 17 -1 0 3 3
d 19 20 21 22 23 -1 0 4 4
e 25 26 27 28 29 -1 0 5 5
f 31 32 33 34 35 -1 0 6 6
6 666 999 555 3 4 8 0 0 1
g 666 999 555 3 4 8 0 0 1
5. Modify
- 使用 DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors=‘ignore’)
Modify the column name
code as follows:
frame.rename(columns={
'A':'key'},inplace=False)
Output result:
key B C D E F price
a 0 1 2 3 4 5 0
b 6 7 8 9 10 11 0
c 12 13 14 15 16 17 0
d 18 19 20 21 22 23 0
e 24 25 26 27 28 29 0
f 30 31 32 33 34 35 0
6. Delete
- drop() function
- The row is deleted by default, and the original data will not be deleted.
- Specify axis=1 to delete columns.
- Specify inplace=True to perform operations directly on the original data.
code show as below:
frame.drop('C')
frame
Output result:
A B C D E F A-B price G H
a 0 1 2 3 4 5 -1 0 999 1
b 6 7 8 9 10 11 -1 0 999 2
c 12 13 14 15 16 17 -1 0 999 3
d 18 19 20 21 22 23 -1 0 999 4
e 24 25 26 27 28 29 -1 0 999 5
f 30 31 32 33 34 35 -1 0 999 6
code show as below:
frame.drop('A',axis=1,inplace=True)
frame
Output result:
B C D E F A-B price G H
a 1 2 3 4 5 -1 0 999 1
b 7 8 9 10 11 -1 0 999 2
c 13 14 15 16 17 -1 0 999 3
d 19 20 21 22 23 -1 0 999 4
e 25 26 27 28 29 -1 0 999 5
f 31 32 33 34 35 -1 0 999 6
- del will directly delete the original data.
The code is as follows:
del frame['G']
frame
Output result:
B C D E F A-B price H
a 1 2 3 4 5 -1 0 1
b 7 8 9 10 11 -1 0 2
c 13 14 15 16 17 -1 0 3
d 19 20 21 22 23 -1 0 4
e 25 26 27 28 29 -1 0 5
f 31 32 33 34 35 -1 0 6
7. Deduplication
- DataFrame.drop_duplicates(subset=None,keep=‘first’, inplace=False)
- subset: Specifies which columns are repeated.
- keep: Leave the first few lines after deduplication, {'first', 'last', False}, default 'first'}, if it is False, remove all duplicate lines.
- inplace: Whether to act on the original DataFrame.
#原来数据情况
B C D E F A-B price H J
a 1 2 3 4 5 -1 0 1 1
b 7 8 9 10 11 -1 0 2 2
c 13 14 15 16 17 -1 0 3 3
d 19 20 21 22 23 -1 0 4 4
e 25 26 27 28 29 -1 0 5 5
f 31 32 33 34 35 -1 0 6 6
6 666 999 555 3 4 8 0 0 6
g 666 999 555 3 4 8 0 0 6
Remove duplicate rows and keep the last row in the duplicate row.
The code is as follows:
#删除重复行
frame.drop_duplicates(keep='last')
Output result:
B C D E F A-B price H J
a 1 2 3 4 5 -1 0 1 1
b 7 8 9 10 11 -1 0 2 2
c 13 14 15 16 17 -1 0 3 3
d 19 20 21 22 23 -1 0 4 4
e 25 26 27 28 29 -1 0 5 5
f 31 32 33 34 35 -1 0 6 6
g 666 999 555 3 4 8 0 0 6
The code for deduplicating rows with duplicate values in column 'J'
is as follows:
frame.drop_duplicates(subset=('J',))
Output result:
B C D E F A-B price H J
a 1 2 3 4 5 -1 0 1 1
b 7 8 9 10 11 -1 0 2 2
c 13 14 15 16 17 -1 0 3 3
d 19 20 21 22 23 -1 0 4 4
e 25 26 27 28 29 -1 0 5 5
f 31 32 33 34 35 -1 0 6 6
8. Sorting
- sort_values()
- by: specify which column to sort by
- ascending: whether to ascend
code show as below:
frame.sort_values(by='J',ascending=False)
Output result:
B C D E F A-B price H J
f 31 32 33 34 35 -1 0 6 6
6 666 999 555 3 4 8 0 0 6
g 666 999 555 3 4 8 0 0 6
e 25 26 27 28 29 -1 0 5 5
d 19 20 21 22 23 -1 0 4 4
c 13 14 15 16 17 -1 0 3 3
b 7 8 9 10 11 -1 0 2 2
a 1 2 3 4 5 -1 0 1 1
9. Merge
- The merge method is mainly based on the common columns of the two dataframes to merge
- The join method is mainly based on the index of two dataframes to merge
- The concat method is to concatenate or column concatenate series or dataframe
merge method
single-column-based joins
The inner join
code is as follows:
import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='inner',on='A')
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A freatrue1 feature2
0 1 0.0 one
1 2 0.2 two
2 2 0.4 three
3 4 0.6 four
4 5 0.8 five
>>df2
A color fruits
0 1 red apple
1 1 blue grades
2 2 orange watermelon
3 7 purple pear
4 8 pink mango
>>df3
A freatrue1 feature2 color fruits
0 1 0.0 one red apple
1 1 0.0 one blue grades
2 2 0.2 two orange watermelon
3 2 0.4 three orange watermelon
outer join
- Join based on the union of common columns, parameter how='outer', on=common column name. If there is no same column between the two dataframes except the connection column set by on, the value of this column is set to NaN.
code show as below:
import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='outer',on='A')
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A freatrue1 feature2
0 1 0.0 one
1 2 0.2 two
2 2 0.4 three
3 4 0.6 four
4 5 0.8 five
>>df2
A color fruits
0 1 red apple
1 1 blue grades
2 2 orange watermelon
3 7 purple pear
4 8 pink mango
>>df3
A freatrue1 feature2 color fruits
0 1 0.0 one red apple
1 1 0.0 one blue grades
2 2 0.2 two orange watermelon
3 2 0.4 three orange watermelon
4 4 0.6 four NaN NaN
5 5 0.8 five NaN NaN
6 7 NaN NaN purple pear
7 8 NaN NaN pink mango
left join
- Connect based on the columns of the dataframe at the left position, the parameters how='left', on=shared column names. If there is no same column between the two dataframes except the connection column set by on, the value of this column is set to NaN.
code show as below:
import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='left',on='A')
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A freatrue1 feature2
0 1 0.0 one
1 2 0.2 two
2 2 0.4 three
3 4 0.6 four
4 5 0.8 five
>>df2
A color fruits
0 1 red apple
1 1 blue grades
2 2 orange watermelon
3 7 purple pear
4 8 pink mango
>>df3
A freatrue1 feature2 color fruits
0 1 0.0 one red apple
1 1 0.0 one blue grades
2 2 0.2 two orange watermelon
3 2 0.4 three orange watermelon
4 4 0.6 four NaN NaN
5 5 0.8 five NaN NaN
right join
- Connect based on the columns of the dataframe at the right position, the parameters how='right', on=common column names. If there is no same column between the two dataframes except the connection column set by on, the value of this column is set to NaN.
code show as below:
import pandas as pd
import numpy as np
#定义df1
df1=pd.DataFrame({
'A':[1,,3,4,5],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
#定义df2
df2=pd.DataFrame({
'A':[1,2,6,7,8],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
#基于相同列A的内连接
df3=pd.merge(df1,df2,how='right',on='A')
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A freatrue1 feature2
0 1 0.0 one
1 2 0.2 two
2 2 0.4 three
3 4 0.6 four
4 5 0.8 five
>>df2
A color fruits
0 1 red apple
1 1 blue grades
2 2 orange watermelon
3 7 purple pear
4 8 pink mango
>>df3
A freatrue1 feature2 color fruits
0 1 0.0 one red apple
1 1 0.0 one blue grades
2 2 0.2 two orange watermelon
3 2 0.4 three orange watermelon
4 7 NaN NaN purple pear
5 8 NaN NaN pink mango
Joins based on multiple columns
The inner join (intersection) code of multiple columns
is as follows:
df1=pd.DataFrame({
'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='inner',on=['A','B'])
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A B freatrue1 feature2
0 1 a 0.0 one
1 2 b 0.2 two
2 2 a 0.4 three
3 4 d 0.6 four
4 5 c 0.8 five
>>df2
A B color fruits
0 1 e red apple
1 1 g blue grades
2 2 a orange watermelon
3 7 d purple pear
4 8 c pink mango
>>df3
A B freatrue1 feature2 color fruits
0 2 a 0.4 three orange watermelon
The code for outer join (union) of multiple columns
is as follows:
df1=pd.DataFrame({
'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='outer',on=['A','B'])
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A B freatrue1 feature2
0 1 a 0.0 one
1 2 b 0.2 two
2 2 a 0.4 three
3 4 d 0.6 four
4 5 c 0.8 five
>>df2
A B color fruits
0 1 e red apple
1 1 g blue grades
2 2 a orange watermelon
3 7 d purple pear
4 8 c pink mango
>>df3
A B freatrue1 feature2 color fruits
0 1 a 0.0 one NaN NaN
1 2 b 0.2 two NaN NaN
2 2 a 0.4 three orange watermelon
3 4 d 0.6 four NaN NaN
4 5 c 0.8 five NaN NaN
5 1 e NaN NaN red apple
6 1 g NaN NaN blue grades
7 7 d NaN NaN purple pear
8 8 c NaN NaN pink mango
The left join code for multiple columns
is as follows:
df1=pd.DataFrame({
'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']})
df3=pd.merge(df1,df2,how='left',on=['A','B'])
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A B freatrue1 feature2
0 1 a 0.0 one
1 2 b 0.2 two
2 2 a 0.4 three
3 4 d 0.6 four
4 5 c 0.8 five
>>df2
A B color fruits
0 1 e red apple
1 1 g blue grades
2 2 a orange watermelon
3 7 d purple pear
4 8 c pink mango
>>df3
A B freatrue1 feature2 color fruits
0 1 a 0.0 one NaN NaN
1 2 b 0.2 two NaN NaN
2 2 a 0.4 three orange watermelon
3 4 d 0.6 four NaN NaN
4 5 c 0.8 five NaN NaN
Index-based connection method
code show as below:
import numpy as np
import pandas as pd
df1=pd.DataFrame({
'A':[1,2,2,4,5],'B':['a','b','a','d','c'],'freatrue1':np.arange(0,1,0.2),'feature2':['one','two','three','four','five']})
df2=pd.DataFrame({
'A':[1,1,2,7,8],'B':['e','g','a','d','c'],'color':['red','blue','orange','purple','pink'],'fruits':['apple','grades','watermelon','pear','mango']},index=[4,5,6,7,8])
#基于df1的A列与df2的index连接
df3=pd.merge(df1,df2,how='inner',left_on='A',right_index=True)
#设置参数suffixes以修改除连接列外相同列的后缀名
df4=pd.merge(df1,df2,how='inner',left_on='A',right_index=True,suffixes=('_df1','_df2'))
print(df1)
print(df2)
print(df3)
print(df4)
Output result:
>>df1
A B freatrue1 feature2
0 1 a 0.0 one
1 2 b 0.2 two
2 2 a 0.4 three
3 4 d 0.6 four
4 5 c 0.8 five
>>df2
A B color fruits
4 1 e red apple
5 1 g blue grades
6 2 a orange watermelon
7 7 d purple pear
8 8 c pink mango
>>df3
A A_x B_x freatrue1 feature2 A_y B_y color fruits
3 4 4 d 0.6 four 1 e red apple
4 5 5 c 0.8 five 1 g blue grades
>>df4
A A_df1 B_df1 freatrue1 feature2 A_df2 B_df2 color fruits
3 4 4 d 0.6 four 1 e red apple
4 5 5 c 0.8 five 1 g blue grades
join method
- Based on inde connection dataframe, the connection method is consistent with merge, inner connection, outer connection, left connection and right connection.
index and index connection
code show as below:
df1=pd.DataFrame({
'A':[1,2,3,4,5],'B':['red','blue','orange','purple','pink']})
df2=pd.DataFrame({
'A':[1,2,3],'fruits':['apple','grades','watermelon']})
# lsuffix 和 rsuffix 设置连接的后缀名
df3=df1.join(df2,lsuffix='_caller',rsuffix='_other',how='inner')
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A B
0 1 red
1 2 blue
2 3 orange
3 4 purple
>>df2
A fruits
0 1 apple
1 2 grades
2 3 watermelon
>>df3
A_caller B A_other fruits
0 1 red 1 apple
1 2 blue 2 grades
2 3 orange 3 watermelon
Join based on columns
code show as below:
df1=pd.DataFrame({
'A':[1,2,3,4,5],'B':['red','blue','orange','purple','pink']})
df2=pd.DataFrame({
'A':[1,2,3],'fruits':['apple','grades','watermelon']})
#基于A列进行连接
df3=df1.set_index('A').join(df2.set_index('A'),how='inner')
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A B
0 1 red
1 2 blue
2 3 orange
3 4 purple
>>df2
A fruits
0 1 apple
1 2 grades
2 3 watermelon
>>df3
B fruits
A
1 red apple
2 blue grades
3 orange watermelon
concat method
- The concat method is a splicing function, including row splicing and column splicing. The default is row splicing, and the splicing method defaults to outer splicing (union). The object of splicing is pandas data type.
splicing method of series type
row splicing
code show as below:
df1=pd.Series([1,2,3],index=['a','b','c'])
df2=pd.Series([4,5,6],index=['b','c','d'])
df3=pd.concat([df1,df2])
df4=pd.concat([df1,df2],keys=['fea1','fea2'])#行拼接若有相同的索引,为了区分索引,我们最外层定义了索引的分组情况。
print(df1)
print(df2)
print(df3)
print(df4)
Output result:
>>df1
a 1
b 2
c 3
dtype: int64
>>df2
b 4
c 5
d 6
dtype: int64
>>df3
a 1
b 2
c 3
b 4
c 5
d 6
dtype: int64
>>df4
fea1 a 1
b 2
c 3
fea2 b 4
c 5
d 6
dtype: int64
column splicing
- The default is splicing by union
code show as below:
pd.concat([df1,df2],axis=1)
Output result:
0 1
a 1.0 NaN
b 2.0 4.0
c 3.0 5.0
d NaN 6.0
- splicing by intersection
- keys : Set the column names for column splicing
- join: 'inner' intersection
code show as below:
pd.concat([df1,df2],axis=1,join='inner',keys=['fea1','fea2'])
Output result:
0 1
b 2 4
c 3 5
DataFrame type concatenation method
row splicing
code show as below:
df1=pd.DataFrame({
'A':[1,2,3],'fea1':['b','c','d']})
df2=pd.DataFrame({
'A':[4,5,6],'fea1':['a','b','c']})
df3=pd.concat([df1,df2]) #行拼接
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A fea1
0 1 b
1 2 c
2 3 d
>>df2
A fea1
0 4 a
1 5 b
2 6 c
>>df3
A fea1
0 1 b
1 2 c
2 3 d
0 4 a
1 5 b
2 6 c
The column splicing
code is as follows:
df1=pd.DataFrame({
'A':[1,2,3],'fea1':['b','c','d']})
df2=pd.DataFrame({
'A':[4,5,6],'fea1':['a','b','c']})
df3=pd.concat([df1,df2],axis=1) #列拼接
print(df1)
print(df2)
print(df3)
Output result:
>>df1
A fea1
0 1 b
1 2 c
2 3 d
>>df2
A fea1
0 4 a
1 5 b
2 6 c
>>df3
A fea1 A fea1
0 1 b 4 a
1 2 c 5 b
2 3 d 6 c