文章目录

1.重复值处理

1.重复值处理

DataFrame.duplicated 计算是否有重复值
DataFrame.duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first')
DataFrame.drop_duplicates 删除重复值

参考《https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html》
DataFrame.drop_duplicates(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] ='first', inplace: bool = False, ignore_index: bool = False)

以下是我根据视频完整的操作记录，仅稍作整理，以备后续查看。

import pandas as pd
import numpy as np
import os

进入文档所在路径

os.chdir(r'C:\代码和数据')
#路径前不加r的话需要将单斜杠\变为双斜杠\\

读取文档

df =pd.read_csv('MotorcycleData.csv',encoding='gbk',na_values='Na')
#将数据为‘Na’的当作缺失值处理,注意不要写成na_value,应为na_values

查看前三行

df.head(3)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	$11,412	McHenry, Illinois, United States	2013.0	16,000	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	$17,200	Fort Recovery, Ohio, United States	2016.0	60	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	$3,872	Chicago, Illinois, United States	1970.0	25,763	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0

3 rows × 22 columns

自定义一个函数用于去掉Price和Mileage中的字符，留下数字，并将数值转为浮点型

def f(x):
    if '$' in str(x):  #去掉Price中的$和，
        x = str(x).strip('$')
        x = str(x).replace(',','')
    else:              #去掉Mileage中的,
        x = str(x).replace(',','')
    return float(x)

对Price和Mileage两个字段用自定义函数f进行处理

df['Price']=df['Price'].apply(f)
df['Mileage']=df['Mileage'].apply(f)

查看处理后的字段数值

df[['Price','Mileage']].head(3)

	Price	Mileage
0	11412.0	16000.0
1	17200.0	60.0
2	3872.0	25763.0

#查看处理后的字段类型

df[['Price','Mileage']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 2 columns):
Price      7493 non-null float64
Mileage    7467 non-null float64
dtypes: float64(2)
memory usage: 117.2 KB

df.duplicated()函数，有重复值时该行显示为TRUE否则为FALSE，默认axis=0判断显示

any(df.duplicated())#判断df中是否含有重复值，一旦有的话就是TRUE

True

df[df.duplicated()].head(3) #展示df重复的数据

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
57	Used	NaN	4050.0	Gilberts, Illinois, United States	2006.0	6650.0	Black	Harley-Davidson	Vehicle does NOT have an existing warranty	Softail	...	NaN	FALSE	NaN	7<	58	Private Seller	Clear	True	TRUE	3.0
63	Used	NaN	7300.0	Rolling Meadows, Illinois, United States	1997.0	20000.0	Black	Harley-Davidson	Vehicle does NOT have an existing warranty	Sportster	...	NaN	TRUE	100	5<	111	Private Seller	Clear	False	TRUE	NaN
64	Used	Dent and scratch free. Paint and chrome in exc...	5000.0	South Bend, Indiana, United States	2003.0	1350.0	Black	Harley-Davidson	Vehicle does NOT have an existing warranty	Sportster	...	NaN	FALSE	100	14	37	Private Seller	Clear	False	TRUE	NaN

3 rows × 22 columns

np.sum(df.duplicated())#计算重复的数量

drop_duplicates()函数，删除重复数据

df.drop_duplicates().head(3)#删除重复的数据，并返回删除后的视图。
# inplace=True 时才会对原数据进行操作
#这是是drop_duplicates不是drop_duplicated

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	17200.0	Fort Recovery, Ohio, United States	2016.0	60.0	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	3872.0	Chicago, Illinois, United States	1970.0	25763.0	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0

3 rows × 22 columns

查看行与列的数量

df.shape

(7493, 22)

查看每一列的名称

df.columns

Index(['Condition', 'Condition_Desc', 'Price', 'Location', 'Model_Year',
       'Mileage', 'Exterior_Color', 'Make', 'Warranty', 'Model', 'Sub_Model',
       'Type', 'Vehicle_Title', 'OBO', 'Feedback_Perc', 'Watch_Count',
       'N_Reviews', 'Seller_Status', 'Vehicle_Tile', 'Auction', 'Buy_Now',
       'Bid_Count'],
      dtype='object')

删除列’Condition’, ‘Condition_Desc’, ‘Price’, 'Location’重复的值

df.drop_duplicates(subset=['Condition', 'Condition_Desc', 'Price', 'Location'],inplace=True)

查看输出后的列数，明显减少；若未加inplace=True则不会减少

df.shape

(5356, 22)

测试用：选取前两行，只查看第一行

df.head(2)[[True,False]]

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0

1 rows × 22 columns

w.ang.jie

发布了86 篇原创文章 · 获赞 23 · 访问量 3万+

私信关注

python数据清洗之学习总结（六、数据清洗之数据预处理)

文章目录

1.重复值处理

猜你喜欢