python数据清洗（四）

第四部分清洗数据进行分析
深入了解数据清理的一些重要方面。学习字符串操作和模式匹配以处理非结构化数据，然后探索处理丢失或重复数据的技术。学习以编程方式检查数据的一致性的技能，用以确信代码正确运行并且分析结果可靠！

一、数据类型·

1、转换数据类型
了解如何确保DataFrame中的所有分类变量属于category，category可以减少内存使用量。

tips数据集已加载到名为tips的DataFrame中。该数据包含有关客户倾斜程度，客户是男性还是女性，吸烟与否等信息。查看IPython Shell中tips.info（）的输出。你会注意到两个应该是分类（ categorical ）的列 - 性别和吸烟者 - 而不是类型对象，这是pandas存储任意字符串的方式。 我们要做的是将这两列转换为category类别，并注意减少的内存使用量。

#Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker =tips['smoker'].astype('category')

# Print the info of tips
print(tips.info())

通过将性别和吸烟者转换为分类变量，DataFrame的内存使用率从13.4 KB降至10.1 KB。这可能看起来没多大变化，但是当你处理大型数据集时，内存使用量的减少可能非常显着！

2、使用数值数据
如果希望列的数据类型为numeric（int或float），但显示的类型为object，则通常意味着列中存在非数字值，这也表示错误数据。

使用pd.to_numeric（）函数可以将列转换为数值数据类型。如果函数引发错误，则可以确保列中存在错误值。此时可以使用第1部分中学到的技术进行一些探索性数据分析并查找错误值，也可以选择忽略或强制将值转换为缺失值NaN。通过指定关键字参数errors ='coerce'将错误强制转换为NaN。

tips数据集的修改版本已预先加载到名为tips的DataFrame中。它已被预处理以引入一些“坏”数据供我们清理。此DataFrame中的'total_bill'和'tip'列存储为object类型，因为在这些列中我们使用字符串'missing'来编码缺失值。通过将值强制转换为数值类型，它们将成为正确的NaN值。使用.info（）方法来探索它。

可以看到，total_bill和tip列是object类型，它应该是数值列。我们要做的就是将这两列转化成数值列。

# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'],errors='coerce')

# Print the info of tips
print(tips.info())

二、使用正则表达式清理字符串

1、使用正则表达式解析字符串
正则表达式的基础知识，这是定义匹配字符串的模式的有效方法。

处理数据时，有时需要编写正则表达式来查找正确输入的值。数据集中的电话号码是需要检查有效性的常见字段。在本练习中的工作是定义一个正则表达式，以匹配符合xxx-xxx-xxxx模式的美国电话号码。

python中的正则表达式模块是re。在对数据执行模式匹配时，由于模式将用于跨多行的匹配，因此最好先使用re.compile（）编译模式，然后使用已编译的模式匹配值。

# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

2、从字符串中提取数值
从字符串中提取数字是一项常见任务，尤其是在处理非结构化数据或日志文件时。

假设你有以下字符串：'the recipe calls for 10 strawberries and 1 bananas'。从该字符串中提取10和1并保存起来，之后用来比较草莓与香蕉比率。使用正则表达式提取多个数字（或确切地说是多个模式匹配）时，可以使用re.findall（）函数。 它很容易使用：你将一个模式和一个字符串传递给re.findall（），它将返回一个匹配列表。使用re.findall（）函数并传递两个参数：模式，后跟字符串。 \ d是查找数字所需的模式。 \ d+，确保将10视为一个数字而不是1和0。

# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)

['10', '1']

3、模式匹配
在本练习中，将继续练习正则表达技巧。对于每个提供的字符串，编写适当的模式以匹配它。

(1) 格式为xxx-xxx-xxxx的电话号码。
(2) 格式的字符串：美元符号，任意位数，小数点，2位数。
使用\ $匹配美元符号，\ d *匹配任意位数，\. 匹配小数点，\ d {x}匹配x位数。
(3)大写字母，后跟任意数量的字母数字字符。
使用[A-Z]匹配任何大写字母后跟 \ w *以匹配任意数量的字母数字字符

# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)

三、使用函数清理数据

1、自定义函数来清理数据

tips数据集已预先加载到名为tips的DataFrame中。它有一个“sex”列，其中包含值“Male”或“Female”。编写一个函数，将'Female'重新编码为0，将“Male'重新编码为1，并返回np.nan以获取既不是“Female”也不是“male”的所有“sex”条目。

像这样重新编码变量是一种常见的数据清理任务。函数提供了一种机制，可以抽象出复杂的代码并重用代码。这使代码更易读，更不容易出错。可以使用.apply（）方法在DataFrame的整个行或列中应用函数。但请注意，DataFrame的每一列都是一个pandas的series。函数也可以跨series应用。在“sex”栏中应用该函数。

# Define recode_gender()
def recode_gender(gender):

    # Return 0 if gender is 'Female'
    if gender == 'Male':
        return 1
    
    # Return 1 if gender is 'Male'    
    elif gender == 'Female':
        return 0
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips
print(tips.head())

对于简单的重新编码，还可以使用replace方法。也可以将列转换为catetory类型。

2、Lambda函数
功能强大的Python功能，它将帮助更有效地清理数据：lambda函数。 lambda函数不是使用def语法，而是使用简单的单行函数。

tips数据集已预先加载到名为tips的DataFrame中。通过删除美元符号来清除其“total_dollar”列。使用两种不同的方法执行此操作：使用.replace（）方法和正则表达式。

# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
print(tips.head())

四、重复和丢失数据

1、删除重复数据
重复数据会导致各种问题。从性能的角度来看，它们消耗了不必要的内存量，并在处理数据时导致不必要的计算。此外，他们还可以偏向任何分析结果。

由Billboard图表上的歌曲表演组成的数据集已预先加载到名为billboard的DataFrame中。在本练习中的工作是对此DataFrame进行子集化，然后删除所有重复的行。

# Create the new DataFrame: tracks
tracks = billboard[['year','artist','track','time']]

# Print info of tracks
print(tracks.info())

# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

删除重复项后，DataFrame已从24092个条目变为仅317个！

2、填写缺失的数据
第二部分中使用的空气质量数据集，已预先加载到DataFrame空气质量中，并且它缺少练习填写的值。在IPython Shell中探索空气质量以检查哪些列具有缺失值。很少有（真实世界的）数据集没有任何缺失值，处理它们很重要，因为某些计算无法处理缺失值，而默认情况下，某些计算会跳过任何缺失值。此外，了解您拥有的数据缺失程度，并考虑其来源对于对数据进行无偏见的解释至关重要。

使用airquality.Ozone上的.mean（）方法计算臭氧气压柱的平均值。
使用fillna（）方法用平均值oz_mean替换airquality的Ozone列中的所有缺失值。

# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

五、使用断言测试数据
在这里，练习使用前面章节中的埃博拉数据集编写断言语句，以编程方式检查缺失值并确认所有值均为正值。数据集已预先加载到名为ebola的DataFrame中。使用.all（）方法和.notnull（）DataFrame方法来检查列中的缺失值。如果所有值都为True，则.all（）方法返回True。在DataFrame上使用时，它会返回一系列布尔值 - 一个用于DataFrame中的每一列。因此，如果在DataFrame上使用它，就像在本练习中一样，需要链接另一个.all（）方法，以便只返回一个True或False值。 在assert语句中使用这些语句时，如果assert语句为true，则不会返回任何内容：这是可以确认正在检查的数据是否有效的方法。

注意：可以使用pd.notnull（df）作为df.notnull（）的替代方法。

(1) 编写断言语句以确认ebola中没有缺失值。
在ebola（或ebola的.notnull（）方法）上使用pd.notnull（）函数，并链接两个.all（）方法（即.all（）。all（））。第一个.all（）方法将为每列返回True或False，而第二个.all（）方法将返回单个True或False。
(2) 编写断言语句以确认埃博拉中的所有值都大于或等于0。
将两个all（）方法链接到布尔条件（ebola> = 0）。

# Assert that there are no missing values
assert pd.notnull(ebola).all().all()

# Assert that all values are >= 0
assert (ebola >= 0).all().all()

由于断言语句没有抛出任何错误，因此可以确保数据中没有缺失值，并且所有值都> = 0！

python数据清洗（四）

猜你喜欢