Table of contents
Changes to column names and indexes
Grouping and aggregation operations
foreword
This article introduces common operations for data processing and cleaning in Pandas. It mainly includes processing of missing data, processing of duplicate data, data type conversion, changes of column names and indexes, and grouping and aggregation operations. For each operation, a corresponding code example is given. These operations are very important for data analysis and modeling, and can help us better understand and process data.
Handling of missing data
In actual data processing, missing data is often encountered. At this time, processing such as data filling or deletion is required. Pandas provides fillna() and dropna() functions to handle missing data.
$import pandas as pd
import numpy as np
# 创建含有缺失数据的DataFrame
df = pd.DataFrame({'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, np.nan]})
# 使用fillna()函数填充缺失数据
df.fillna(0)
# 使用dropna()函数删除缺失数据
df.dropna()$
Duplicate data processing
The existence of duplicate data may affect the analysis results, and duplicate data processing is required. Pandas provides the drop_duplicates() function to remove duplicate data.
import pandas as pd
# 创建含有重复数据的DataFrame
df = pd.DataFrame({'A': [1, 1, 2, 3],
'B': [4, 5, 6, 6]})
# 使用drop_duplicates()函数去除重复数据
df.drop_duplicates()
data type conversion
During data processing, data types need to be converted. Pandas provides the astype() function to convert data types.
import pandas as pd
# 创建含有不同数据类型的DataFrame
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['4', '5', '6']})
# 使用astype()函数进行数据类型转换
df['B'] = df['B'].astype(int)
Changes to column names and indexes
During data processing, changes need to be made to column names and indexes. Pandas provides the rename() function to change column names and indexes.
import pandas as pd
# 创建含有不同列名和索引的DataFrame
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=['a', 'b', 'c'])
# 使用rename()函数进行列名和索引的更改
df = df.rename(columns={'A': 'new_A'}, index={'a': 'new_a'})
Grouping and aggregation operations
During data processing, data needs to be grouped and aggregated. Pandas provides groupby() and agg() functions for grouping and aggregation operations.
import pandas as pd
# 创建含有不同数据的DataFrame
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'],
'B': ['x', 'y', 'x', 'y'],
'C': [1, 2, 3, 4]})
# 使用groupby()函数进行分组操作
grouped = df.groupby(['A', 'B'])
# 使用agg()函数进行聚合操作
grouped.agg({'C': 'sum'})
Summarize
This article describes common operations for data processing and cleaning in Pandas. This includes handling of missing data, handling of duplicate data, data type conversion, changes to column names and indexes, and grouping and aggregation operations. For each operation, a corresponding code example is given.