Python机器学习（九十六）Pandas apply函数

与Python中的列表类似，可以使用for循环遍历DataFrame或Series，但是这样做(尤其是在大型数据集上)非常慢。

Pandas中提供了一个高效的替代方案：apply()方法。

语法

DataFrame.apply(func)

Series.apply(func)

func – 要对数据集中所有元素执行的函数

下面的例子，对于DataFrame中的所有影片，评分大于8.0的标明”good”，否则标明”bad”。

首先，创建一个函数，如果评分>8.0，返回”good”，否则返回”bad”：

def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

现在，通过apply()把上面的函数应用到”rating”列中的所有元素:

# 加载数据
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']

# 对"rating"列，应用rating_function
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

movies_df.head(2)

输出

                         rank                     genre  ... metascore rating_category
Title                                                    ...
Guardians of the Galaxy     1   Action,Adventure,Sci-Fi  ...      76.0            good
Prometheus                  2  Adventure,Mystery,Sci-Fi  ...      65.0             bad

[2 rows x 12 columns]

apply()方法对rating列中所有元素执行rating_function函数，然后返回一个新的Series。这个系列分配给一个名为rating_category的新列。

apply()方法中，还可以使用匿名函数。这个lambda函数实现了与rating_function相同的功能:

movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')

movies_df.head(2)

输出

                         rank                     genre  ... metascore rating_category
Title                                                    ...
Guardians of the Galaxy     1   Action,Adventure,Sci-Fi  ...      76.0            good
Prometheus                  2  Adventure,Mystery,Sci-Fi  ...      65.0             bad

[2 rows x 12 columns]

总的来说，使用apply()要比手工遍历行快得多，因为Pandas内部使用了向量化。

向量化: 一种计算机编程风格，操作应用于整个数组而不是单个元素 – wikipedia

apply()在自然语言处理(NLP)中的高使用率就是一个很好的例子。自然语言处理(NLP)时，需要将各种文本清理功能应用于字符串，以便为机器学习做准备。

Python机器学习（九十六）Pandas apply函数

猜你喜欢