亲和性分析示例
亲和性分析根据样本个体性之间的相似度,确定关系的亲疏,应用场景如下:
- 向网站用户提供多样化的服务和定向投放广告
- 向用户推荐电影和商品的同时,销售小玩具
- 根据基因寻找有亲缘关系的人
商品推荐
商品推荐思路:如梦经常一起购买的两件商品,以后也很可能会同时购买。即:
如果一个人购买了商品X,那么他也很有可能购买商品Y。
实例分析
本次使用的数据集是affinity_dataset.txt,它是商品交易数据集。第一条交易数据所包含的商品。竖着看,每一列代表一种商品。
在我们这个例子中,这五种商品分别是面包、牛奶、奶酪、苹果和香蕉。从第一条交易数据中,我们可以看到顾客购买了奶酪、苹果和香蕉,但是没有买面包和牛奶。
每个特征只有两个可能的值,1或0,表示是否购买了某种商品,而不是购买商品的数量。1表示顾客至少买了1个单位的该商品,0表示顾客没有买该种商品。
affinity_dataset.txt 数据集表示如下:
面包 | 牛奶 | 奶酪 | 苹果 | 香蕉 |
---|---|---|---|---|
0 | 0 | 0 | 1 | 1 |
1 | 1 | 0 | 1 | 0 |
1 | 0 | 1 | 1 | 0 |
实验
加载数据集
import numpy as np
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
#This dataset has 100 samples and 5 features
print(X[:5])
#[[ 0. 0. 1. 1. 1.]
#[ 1. 1. 0. 1. 0.]
#[ 1. 0. 1. 1. 0.]
#[ 0. 0. 1. 1. 1.]
#[ 0. 1. 0. 0. 1.]]
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]
# First, how many rows contain our premise: that a person is buying apples
num_apple_purchases = 0
for sample in X:
if sample[3] == 1: # This person bought Apples
num_apple_purchases += 1
print("{0} people bought Apples".format(num_apple_purchases))
#36 people bought Apples
# How many of the cases that a person bought Apples involved the people purchasing Bananas too?
# Record both cases where the rule is valid and is invalid.
rule_valid = 0
rule_invalid = 0
for sample in X:
if sample[3] == 1: # This person bought Apples
if sample[4] == 1:
# This person bought both Apples and Bananas
rule_valid += 1
else:
# This person bought Apples, but not Bananas
rule_invalid += 1
print("{0} cases of the rule being valid were discovered".format(rule_valid))
print("{0} cases of the rule being invalid were discovered".format(rule_invalid))
#21 cases of the rule being valid were discovered
#15 cases of the rule being invalid were discovered
# Now we have all the information needed to compute Support and Confidence
support = rule_valid # The Support is the number of times the rule is discovered.
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
# Confidence can be thought of as a percentage using the following:
print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
#The support is 21 and the confidence is 0.583.
#As a percentage, that is 58.3%.
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
for conclusion in range(n_features):
if premise == conclusion: # It makes little sense to measure if X -> X.
continue
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1
else:
# This person bought the premise, but not the conclusion
invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)
#Rule: If a person buys milk they will also buy apples
# - Confidence: 0.196
# - Support: 9
from operator import itemgetter
sorted_support = sorted(support.items(), key=lambda x:x[1], reverse=True)
print(sorted_support)
#[((2, 4), 27), ((4, 2), 27), ((2, 3), 25), ((3, 2), 25), ((3, 4), 21), ((4, 3), 21), ((1, 4), 19), ((4, 1), 19), ((0, 4), 17), ((4, 0), 17), ((0, 1), 14), ((1, 0), 14), ((1, 3), 9), ((3, 1), 9), ((1, 2), 7), ((2, 1), 7), ((0, 3), 5), ((3, 0), 5), ((0, 2), 4), ((2, 0), 4)]
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, features)
Rule #1
Rule: If a person buys cheese they will also buy bananas
- Confidence: 0.659
- Support: 27
Rule #2
Rule: If a person buys bananas they will also buy cheese
- Confidence: 0.458
- Support: 27
Rule #3
Rule: If a person buys cheese they will also buy apples
- Confidence: 0.610
- Support: 25
Rule #4
Rule: If a person buys apples they will also buy cheese
- Confidence: 0.694
- Support: 25
tips
对一个字典进行排序:两种方法
sorted(d.items(),key = lambda x:x[1],reverse = True)
from operator import itemgetter
sorted(d.items(), key=itemgetter(1),reverse=True)
数据集affinity_dataset.txt下载:
链接:https://pan.baidu.com/s/1iCRmzw9rGDBNMixc6VkD1g 密码:pdny