摘要

本文通过python实现了熵增益和熵增益率的计算、实现了离散变量的决策树模型，并将代码进行了封装，方便读者调用。

熵增益和熵增益率计算

此对象用于计算离散变量的熵、条件熵、熵增益（互信息）和熵增益率
.cal_entropy()：计算熵的函数
.cal_conditional_entropy()：计算条件熵的函数
.cal_entropy_gain()：计算熵增益（互信息）的函数
.cal_entropy_gain_ratio():计算熵增益率的函数
用法：先传入特征和标注创建对象，再调用相关函数计算就行
特征和标注的类型最好转入DataFrame、Series或者list格式
若想计算单个变量的熵，则特征和标注传同一个值就行

import numpy as np
import pandas as pd
import copy
from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import  load_wine,load_breast_cancer

class CyrusEntropy(object):
    """
    此对象用于计算离散变量的熵、条件熵、熵增益（互信息）和熵增益率
    .cal_entropy()：计算熵的函数
    .cal_conditional_entropy()：计算条件熵的函数
    .cal_entropy_gain：计算熵增益（互信息）的函数
    .cal_entropy_gain_ratio():计算熵增益率的函数
    用法：先传入特征和标注创建对象，再调用相关函数计算就行
          特征和标注的类型最好转入DataFrame、Series或者list格式
          若想计算单个变量的熵，则特征和标注传同一个值就行
    """
    def __init__(self,x,y):
        # 特征进行标签编码
        x = pd.DataFrame(x)
        y = pd.Series(y)
        x0 = copy.copy(x)
        y = copy.copy(y)
        for i in range(x.shape[1]):
            x0.iloc[:,i] = LabelEncoder().fit_transform(x.iloc[:,i])
        self.X = x0
        self.Y = pd.Series(LabelEncoder().fit_transform(y))
        
    def cal_entropy(self):
        x_entropy = []
        for i in range(self.X.shape[1]):
            number = np.array(self.X.iloc[:,i].value_counts())
            p = number/number.sum()
            x_entropy.append(np.sum(-p*np.log2(p)))
        number = np.array(self.Y.value_counts())
        p = number/number.sum()
        y_entropy = np.sum(-p*np.log2(p))
        return x_entropy,y_entropy
    
    def cal_conditional_entropy(self):
        y_x_conditional_entropy = []
        for i in range(self.X.shape[1]):
            dict_flag = {}
            list_flag = []
            for j in range(self.X.shape[0]):
                dict_flag[self.X.iloc[j,i]] = dict_flag.get(self.X.iloc[j,i],list_flag) + [self.Y.iloc[j]]
            condition_value = 0
            for y_value in dict_flag.values():
                number = np.array(pd.Series(y_value).value_counts())
                p = number/number.sum()
                condition_value += np.sum(-p*np.log2(p))*len(y_value)/(self.Y.shape[0])
            y_x_conditional_entropy.append(condition_value)
        return y_x_conditional_entropy
                
    def cal_entropy_gain(self):
        return list(np.array(self.cal_entropy()[1])-np.array(self.cal_conditional_entropy()))
        
    def cal_entropy_gain_ratio(self):
        return list(np.array(self.cal_entropy_gain())/np.array(self.cal_entropy()[0]))

熵增益和熵增益率运行结果

使用kaggle上的一份离散变量数据进行模型验证，以下是kaggle上的数据描述：
The Lifetime reality television show and social experiment, Married at First Sight, features men and women who sign up to marry a complete stranger they’ve never met before. Experts pair couples based on tests and interviews. After marriage, couples have only a few short weeks together to decide if they want to stay married or get a divorce. There have been 10 full seasons so far which provides interesting data to look at what factors may or may not play a role in their decisions at the end of eight weeks as well as longer-term outcomes since the show aired.

if __name__ == "__main__":
    data = pd.read_csv("./mafs.csv",header=0)
    Y = data.Status
    X = data.drop(labels="Couple",axis=1)
    X = X.drop(labels="Status",axis=1)
    print(X.head(2))

建立求取信息熵对象
求取各特征和标注的信息熵

# 建立求取信息熵对象
entropy_model = CyrusEntropy(X,Y)
# 求取各特征和标注的信息熵
entropy = entropy_model.cal_entropy()

([3.29646716508619, 3.1199965768508955, 6.087462841250342, 3.520444587294042, 1.0, 6.087462841250342, 0.8739810481273578, 0.0, 0.833764907210665, 0.833764907210665, 0.833764907210665, 0.833764907210665, 0.6722948170756379, 0.8739810481273578, 0.833764907210665], 0.833764907210665)

求取标注相对各特征的条件熵

# 求取标注相对各特征的条件熵
conditon_entropy = entropy_model.cal_conditional_entropy()
print(conditon_entropy)

[0.6655644259732555, 0.699248162082863, 0.0, 0.7352336969711815, 0.833764907210665, 0.0, 0.67371811971174, 0.833764907210665, 0.7982018075321516, 0.7982018075321516, 0.7982018075321516, 0.7982018075321516, 0.8255150132281116, 0.8067159627055736, 0.8276667497383372]

求取标注相对于各特征的信息增益（互信息）

# 求取标注相对于各特征的信息增益（互信息）
entropy_gain = entropy_model.cal_entropy_gain()
print(entropy_gain)

[0.1682004812374095, 0.13451674512780198, 0.833764907210665, 0.09853121023948352, 0.0, 0.833764907210665, 0.16004678749892498, 0.0, 0.035563099678513344, 0.035563099678513344, 0.035563099678513344, 0.035563099678513344, 0.00824989398255338, 0.027048944505091432, 0.006098157472327781]

求取标注相对于各特征的信息增益率

# 求取标注相对于各特征的信息增益率
entropy_gain_rate = entropy_model.cal_entropy_gain_ratio()
print(entropy_gain_rate)

[0.05102446735064376, 0.04311438869063559, 0.1369642704939145, 0.02798828608042902, 0.0, 0.1369642704939145, 0.183123865033287, nan, 0.04265362978335057, 0.04265362978335057, 0.04265362978335057, 0.04265362978335057, 0.012271244360381867, 0.03094912019322165, 0.007314001128602282]

离散变量的决策树模型

此对象为针对离散变量的分类问题建立决策树模型适用的。
.fit()：拟合及训练模型的函数
.predict():模型预测函数
.tree_net：决策树网络
用法：先调用类创建实例对象，再调用fit函数训练模型，
再调用predict函数进行预测，且可通过tree_net属性查看决策树网络。
特征和标注的类型最好转入DataFrame、Series或者list格式

class CyrusDecisionTreeDiscrete(object):
    """
    此对象为针对离散变量的分类问题建立决策树模型适用的。
    .fit()：拟合及训练模型的函数
    .predict():模型预测函数
    .tree_net：决策树网络
    用法：先调用类创建实例对象，再调用fit函数训练模型，
          再调用predict函数进行预测，且可通过tree_net属性查看决策树网络。
          特征和标注的类型最好转入DataFrame、Series或者list格式
    """
    X = None
    Y = None
    def __init__(self,algorithm = "ID3"):
        self.method = algorithm
        self.tree_net = {}
    def tree(self,x,y,dict_):
        entropy_model = CyrusEntropy(x,y)
        index = np.argmax(entropy_model.cal_entropy_gain())
        dict_[index] = {}
        dict_x_flag = {}
        dict_y_flag = {}
        for i in range(x.shape[0]):
            dict_x_flag[x.iloc[i,index]] = dict_x_flag.get(x.iloc[i,index],[]) + [list(x.iloc[i,:])]
            dict_y_flag[x.iloc[i,index]] = dict_y_flag.get(x.iloc[i,index],[]) + [(y.iloc[i])]
        key_list = []
        for key,value in dict_x_flag.items():
            if pd.Series(dict_y_flag[key]).value_counts().shape[0] == 1:
                dict_[index][key] = dict_y_flag[key][0]
            else:
                key_list.append(key)
                dict_[index][key] = {}
        code = ""
        if len(key_list) != 0:  
            for key in key_list:
                code += "self.tree(pd.DataFrame(dict_x_flag['{}']),pd.Series(dict_y_flag['{}']),dict_[{}]['{}']),".format(key,key,index,key)
            code = code[:-1]
            return eval(code)
    def fit(self,x,y):
        self.X = pd.DataFrame(x)
        self.Y = pd.Series(y)
        self.tree(self.X,self.Y,self.tree_net)
    def cal_label(self,x,dict_):
        index = list(dict_.keys())[0]
        if str(type(dict_[index][x[index]])) != "<class 'dict'>":
            return dict_[index][x[index]]
        else:
            return self.cal_label(x,dict_[index][x[index]])
        
    def predict(self,x):
        x = pd.DataFrame(x)
        y = []
        for i in range(x.shape[0]):
            se = pd.Series(x.iloc[i,:])
            y.append(self.cal_label(se,self.tree_net))
        return y

决策树模型运行结果

建立决策树模型
训练并拟合模型
模型预测

# 建立决策树模型
tree_model = CyrusDecisionTreeDiscrete()

# 训练并拟合模型
tree_model.fit(X,Y)

# 模型预测
y_pre = tree_model.predict(X)
print(y_pre)

['Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Married', 'Married', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Married', 'Married', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced', 'Divorced']

准确率检测

# 准确率检测
result = [1 if y_pre[i] == Y[i] else 0 for i in range(len(y_pre))]
print("准确率为：",np.array(result).sum()/len(result))

准确率为： 1.0

by CyrusMay 2020 05 20

时间如果可以倒流
我想我还是
会卯起来蹉跎
反正就这样吧
我知道我
努力过
——————五月天（一颗苹果）——————

机器学习 决策树篇——解决离散变量的分类问题