4.1 选择参数
挖掘亲和性分析所用的关联规则之前,用Apriori算法生成频繁集。然后通过检测频繁集中前提和结论的组合,生成关联规则。
第一个阶段,需要为Apriori算法指定一个项集要成为频繁项集所需的最小支持度。任何小于最小支持度的项集将不再考虑。如果最小支持度值过小,Apriori算法要检测大量的项集,会拖慢运行速度,最小支持度过大的话,则只有很少的频繁集。
找出频繁集后,在第二阶段,根据置信度选取关联规则。可以设定最小置信度,返回一部分规则,或者返回所有规则,让用户自己选。置信度过低将会导致规则支持度高,正确率低;置信度过高,导致正确率高,但是返回的规则少。
4.2 电影推荐问题
4.2.1获取数据集
数据集下载地址https://grouplens.org/datasets/movielens/。本章使用包含100万条数据的MovieLens数据集。
4.2.2 加载数据
import pandas as pd
ratings_filename = r"E:\ml-100k\u.data"
all_ratings = pd.read_csv(ratings_filename,delimiter="\t",header=None,names=["UserID","MovieID","Rating","Datetime"])
# 解析时间戳数据
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"],unit="s")
print(all_ratings[:5])
前5条记录如下:
UserID MovieID Rating Datetime
0 196 242 3 1997-12-04 15:55:49
1 186 302 3 1998-04-04 19:22:22
2 22 377 1 1997-11-07 07:18:36
3 244 51 2 1997-11-27 05:02:03
4 166 346 1 1998-02-02 05:33:16
4.3 Apriori算法的实现
本次数据挖掘的目标是生成如下规则:如果用户喜欢某些电影,那么他们也会喜欢这部电影。创建新特征,处理数据。
import pandas as pd
ratings_filename = r"E:\ml-100k\u.data"
all_ratings = pd.read_csv(ratings_filename,delimiter="\t",header=None,names=["UserID","MovieID","Rating","Datetime"])
# 解析时间戳数据
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"],unit="s")
# 添加新特征,评分大于3的为观众喜欢的电影
all_ratings["Favorable"] = all_ratings["Rating"] > 3
# 取前200用户的打分数据
ratings = all_ratings[all_ratings["UserID"].isin(range(200))]
# 新建数据集,,只包括用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings["Favorable"]]
# print(frozenset(ratings["UserID"].values))
# 遍历每行,按照UserID将用户喜欢的电影进行分组
favorable_reviews_by_users = dict((k,frozenset(v.values)) for k,v in favorable_ratings.groupby("UserID")["MovieID"])
# 创建数据框,了解每部电影的影迷数量
num_favorable_by_movie = ratings[["MovieID","Favorable"]].groupby("MovieID").sum()
# 将Favorable降序排列,选出最受欢迎的五部电影
num_favorable_by_movie_sort = num_favorable_by_movie.sort_values("Favorable",ascending = False)
print(num_favorable_by_movie_sort[:5])
输出结果如下
MovieID Favorable
50 100.0
100 89.0
258 83.0
181 79.0
174 74.0
4.3.1 Apriori算法
Apriori算法是亲和性分析的一部分,专门用于查找数据集中的频繁项集。基本流程是从前一步找到频繁项集中找到新的备选集合,接着检测备选集合的频繁程度是否够高,然后算法像下面步骤进行迭代:
(1)把各项目放到只包含自己的项集中,生成最初的频繁项集。只使用达到最小支持度的项目。
(2)查找现有频繁项集的超集,发现新的频繁项集,并用其生成新的备选项。
(3)测试新生成的备选项集的频繁程度,如果不够频繁,则舍弃。如果没有新的频繁项集,就跳到最后一步。
(4)存储新发现的频繁项集,跳到步骤(2)。
(5)返回发现的所有频繁项集。
过程表示如下:
4.3.2 实现
# -*- coding: UTF-8
import numpy as np
import pandas as pd
import sys
from collections import defaultdict
from operator import itemgetter
ratings_filename = r"E:\ml-100k\u.data"
all_ratings = pd.read_csv(ratings_filename,delimiter="\t",header=None,names=["UserID","MovieID","Rating","Datetime"])
# 解析时间戳数据
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"],unit="s")
# 添加新特征,评分大于3的为观众喜欢的电影
all_ratings["Favorable"] = all_ratings["Rating"] > 3
# 取前200用户的打分数据
ratings = all_ratings[all_ratings["UserID"].isin(range(200))]
# 新建数据集,,只包括用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings["Favorable"]]
# print(frozenset(ratings["UserID"].values))
# 遍历每行,按照UserID将用户喜欢的电影进行分组
favorable_reviews_by_users = dict((k,frozenset(v.values)) for k,v in favorable_ratings.groupby("UserID")["MovieID"])
# 创建数据框,了解每部电影的影迷数量
num_favorable_by_movie = ratings[["MovieID","Favorable"]].groupby("MovieID").sum()
# 将Favorable降序排列,选出最受欢迎的五部电影
num_favorable_by_movie_sort = num_favorable_by_movie.sort_values("Favorable",ascending = False)
# print(num_favorable_by_movie_sort[:5])
frequent_itemsets = {}
# 设置最小支持度
min_support = 50
# 生成初始频繁集
# frequent_itemsets[1] = dict((frozenset((movie_id,)),row["Favorable"]) for movie_id,row in num_favorable_by_movie.iterrows() if row["Favorable"]>min_support)
frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"]) for movie_id, row in num_favorable_by_movie.iterrows() if row["Favorable"] > min_support)
# print(frequent_itemsets[1])
# 创建函数接收新发现的频繁项集,创建超集,检测频繁程度
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
#进行字典初始化
counts = defaultdict(int)
# 遍历所有用户和打分情况
for user, reviews in favorable_reviews_by_users.items():
# 遍历前面找出的项集,判断它们是否是当前评分项集的子集。
# 如果是,表明用户已经 为子集中的电影打过分
for itemset in k_1_itemsets:
if itemset.issubset(reviews):
# 遍历用户打过分却没有出现在项集里的电影,用它生成超集,更新该项集的计数
for other_reviewed_movie in reviews - itemset:
#集合或,生成超集
current_superset = itemset | frozenset((other_reviewed_movie,))
counts[current_superset] += 1
# 最后检测达到支持度要求的项集,看它的频繁程度够不够,并返回其中的频繁项集。
return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
for k in range(2,20):
cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users,frequent_itemsets[k-1],min_support)
frequent_itemsets[k] = cur_frequent_itemsets
if len(cur_frequent_itemsets)==0:
# print("Did not find any frequent itemsets of length {}".format(k))
sys.stdout.flush()
break
else:
# print("I find {} frequent itemsets of length {}".format(len(cur_frequent_itemsets),k))
sys.stdout.flush
del frequent_itemsets[1]
candidate_rules = []
for itemset_length,itemset_counts in frequent_itemsets.items():
for itemset in itemset_counts.keys():
for conclusion in itemset:
premise = itemset - set((conclusion,))
candidate_rules.append((premise,conclusion))
# print(candidate_rules[:5])
# 创建字典,分别存储规则应验和规则不适用的次数
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 遍历所有用户及其喜欢的电影数据,在这个过程中遍历每条关联规则
for user, reviews in favorable_reviews_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if premise.issubset(reviews):
if conclusion in reviews:
correct_counts[candidate_rule] += 1
else:
incorrect_counts[candidate_rule] += 1
# 用规则应验的次数除以前提条件出现的总次数,计算 每条规则的置信度
rule_confidence = {candidate_rule:correct_counts[candidate_rule]/float(correct_counts[candidate_rule]+incorrect_counts[candidate_rule]) for candidate_rule in candidate_rules}
# 获取电影名称
movie_name_filename = r"E:\ml-100k\u.item"
movie_name_data = pd.read_csv(movie_name_filename,delimiter="|",header=None,encoding="mac-roman")
movie_name_data.columns = ["MovieID","Title","Release Date","Video Release","IMDB","<UNK>","Action","Adventure","Animation","Children's","Comedy","Crime","Documentary","Drama","Fanstasy","Film-Noir","Horror","Musical","Mystery","Romance","Sci-Fi","Thriller","War","Western"]
def get_movie_name(movie_id):
title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
title = title_object.values[0]
return title
# 对置信度字典排序后,输出置信度最高的前五条规则
sorted_confidence = sorted(rule_confidence.items(),key=itemgetter(1),reverse=True)
for index in range(5):
print("Rule #{0}".format(index+1))
(premise, conclusion) = sorted_confidence[index][0]
premise_names = ", ".join(get_movie_name(idx) for idx in premise)
conclusion_name = get_movie_name(conclusion)
print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names,conclusion_name))
print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
print("")