数据分析师养成之路之python篇:从头学习机器学习(决策树(熵,信息增益,创建树的代码实现))

python实现熵
对上节课的简单回顾:

实体为人,他有很多属性,长相,身高….
信息: 属性的特征 :长相(属性): 很帅(特征)
熵: (包含所有的属性的信息(特征): 身高:很高,长相:很帅,经济:很有钱) 对它进行求熵
条件熵: 某个属性的信息(特征),求熵(如:身高(属性):很高(特征))
信息增益:熵 -条件熵 : 即为某个属性的信息增益

注:笔者是按’实体-属性-特征 ’ 理解信息,熵,信息增益间的关系,注意里面的特征不要与 ‘数据集中特征和目标变量’中的特征混淆!!其中的特征,即为变量,等同于我们的属性

from sklearn import datasets
import numpy as np
from math import log

np.random.seed(0)
iris=datasets.load_iris()
data_x=iris.data
data_y=iris.target
index0=np.random.permutation(len(data_x))
data_x=data_x[index0]
data_y=data_y[index0]

numEntries=len(data_x)
# 统计每个label的count----这里用label来代替特征
labelCounts={}
for i in range(len(data_x)):
    currentLabel=data_y[i]
    if currentLabel not in labelCounts.keys():
        labelCounts[currentLabel]=0
    labelCounts[currentLabel]+=1
#计算熵
shannonEnt=0.0
for key in labelCounts:
  # P(x)(每个特征数/总数)
    prob=float(labelCounts[key])/numEntries
   # 熵(P(X)*I(x)(I(x)为信息))
    shannonEnt-=prob*log(prob,2)
print(shannonEnt)

划分数据集:
以上内容,我们实现了度量数据集的无序程度—信息熵,但在分类算法中,还需要划分数据集.
通过度量划分数据集的熵,以便我们能判断当前是否正确的划分了数据集.
如何度量呢? 即,我们对每个属性划分数据集的结果计算一次信息熵,从而选择最好的划分方式.

from sklearn import datasets
import numpy as np
import pandas as pd
import random

np.random.seed(0)
iris=datasets.load_iris()
data_x=iris.data
data_y=iris.target
index0=np.random.permutation(len(data_x))
data_x=data_x[index0]
data_y=data_y[index0]

# 自己构造数据集( 原始数据,用于比较的数据,比较结果的数据)
df=pd.DataFrame({'data_1':data_y})
# 随机生成用于比较的数据
new_data=np.random.randint(low=0,high=2,size= 150)
df['data_2']=new_data
# random.shuffle(charr)
# charr中存放比较结果,若相同,则YES,否则NO
charr=[]
for i in range(150):
    if df['data_1'][i]==df['data_2'][i]:
        charr.append('YES')
    else:
        charr.append('NO')
df['class']=charr     

# 数据集划分: 这里有些麻烦,后面有简单的
retDataSet=[]
for x in range(len(df['data_1'])):
# 1 为 data_1中的值（0，1，2），意为，以‘data_1’这个特征（我们之前说的属性）来划分数据集,当然 这里面的1，可换成，0,2!
    if df['data_1'][x]==1:
        reducedFeatVec=[]
        reducedFeatVec.extend(str(df['data_2'][x]))
        reducedFeatVec.append(df['class'][x])
        retDataSet.append(reducedFeatVec)

选择最好的数据集划分方式:
将求数据香农熵算法和划分数据集算法分别写成函数,以方便使用
注意：为方便使用，和前面略有不同！！
香农熵

def shannonEnt(data_x):
    numEntries=len(data_x)
    # 统计每个label的count----这里用label来代替特征
    labelCounts={}
    for i in range(len(data_x)):
        # subSet的第0列是标签
        currentLabel=data_x['class'][i]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1
    #计算熵
    shannonEnt=0.0
    for key in labelCounts:
      # P(x)(每个特征数/总数)
        prob=float(labelCounts[key])/numEntries
       # 熵(P(X)*I(x)(I(x)为信息))
        shannonEnt-=prob*log(prob,2)
    return shannonEnt

划分数据集:
(我只想说,怎么方便怎么来)

def splitDataset(df,feat,feat_sub):

    retDataSet=df[df.iloc[:,feat]==feat_sub]
    retDataSet.index=range(len(retDataSet))

    return retDataSet

放一波计算条件熵的公式，便于之后代码的实现！
这里写图片描述

条件熵,信息增益

def chooseBestFeature(df):
#计算整体熵
    baseEntropy=shannonEnt(df)
    bestInfoGain=0.0
    bestFeature=-1
# 选特征(属性)
    numFeatures=len(df.columns)-1
# 遍历特征
    for i in range(numFeatures):
        newEntropy=0.0

# 遍历每个特征的子特征,对数据集进行划分,并求每个条件熵
        for value in list(set(df.iloc[:,i])):
            subSet=splitDataset(df,i,value)
            #得到P(x)
            prob=len(subSet)/float(len(df))
            # P1*h(x1)+P2*h(X2)+...Pn*h(xn)
            newEntropy +=prob* shannonEnt(subSet)
# 求信息增益
        infogain=baseEntropy-newEntropy
        infoGain.append(infogain)
        bestInfoGain=max(infoGain)
        bestFeature=infoGain.index(bestInfoGain)
    return bestFeature

构建决策树的步骤：
得到原始数据集
基于特征划分数据集
对于多于两个的特征，在第一次划分后，我们需要对已划分的数据再划分，，依次递归进行，递归结束的条件是，程序遍历完所有划分数据集的属性

#计算分类之后的标签
def majorityCnt(classList):
     classCount = {}
     for vote in classList:
          if vote not in classCount.keys():
               classCount[vote] = 0
          classCount[vote] += 1
     sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
     return sortedClassCount

创建决策树!!!

labels=df.columns
def createTree(df,labels):
# 当class中只有一个类别(如,只有yes,或只有no)
    if len(set(df['class']))==1:
        return set(df['class'])
# 当class中有不值一个类别,但是此时,feat已经提取完毕(即df中只剩下了一个class,没有feat了)
    if len(df.columns)==1:
    # 此时,返回类别最多的那个(如:YES:80,NO:50,则返回YES)
        return majorityCnt(classList)
        # 提取最好特征位置
    bestFeat = chooseBestFeature(df)
    # 最好特征的所在列名
    bestFeatLabel=df.columns[bestFeat]
    # labels中删除当前的特征名,因为下一次选特征的时候,不包含该特征了
    labels=labels.drop(bestFeatLabel)
    # 将特征名存放入树中
    myTree={bestFeatLabel:{}}
    featValue=df.iloc[:,bestFeat]
    uniqueVals=set(featValue)
# 按特征的子特征进行    bestFeat = chooseBestFeature(df)分类
    for value in uniqueVals:
        subLabels = labels
        myTree[bestFeatLabel][value] = createTree(splitDataset(df,bestFeat,value),subLabels)
    return myTree

使用决策树的分类函数

扫描二维码关注公众号，回复： 2503932 查看本文章

Tree_x=createTree(df,labels)
Tree_x

输出结果如下:

{'data_1': {
            0: {'data_2': {0: {'YES'}, 1: {'NO'}}},
            1: {'data_2': {0: {'NO'}, 1: {'YES'}}},
            2: {'NO'}
            }
    }

由上可知,’data_1’ 的3个子特征: 0,1,2,’data_2’的2个特征: 0,1
考虑一下下面的操作
‘data_1’–> 0–>’data_2’ –> 0: 我们得到的是’YES’
‘data_1’–>1–>’data_2’ –>0: 我们得到的是’NO’

我们要如何实现上面的操作呢?

新建两个列表

# 先执行 'data_1' 后执行'data_2'
featLabels=['data_1','data_2']
# 先执行 'data_1'下 的 1,再执行 'data_2'下的 0
testVec=[1,0]
# testVec=[2]

def classify(Tree,featLabels,testVec):
# 取字典中的第一个key
    firstStr=list(Tree.keys())[0]
    # 取对应的value
    secondDict=Tree.get(firstStr)
    # 取key下的index,注意featLabels的索引正与testVec的索引一一对应!!
    featindex=featLabels.index(firstStr)
    for key in secondDict:
    # testVec中的特征所对应的子特征存在时
        if testVec[featIndex]==key:
            if type(secondDict[key].__name__)=='dict':
                classlabel=classify(seondDict,featLabels,testVec)
            else:
                classlabel=secondDict[key]
    return  classlabel

数据分析师养成之路之python篇:从头学习机器学习(决策树(熵,信息增益,创建树的代码实现))

猜你喜欢