Wu Enda 471 Machine Learning Introductory Course 2 Week 4 - Decision Tree


Implement a decision tree from scratch and apply it to the task of classifying whether a mushroom is edible or poisonous.

1 guide package

import numpy as np
import matplotlib.pyplot as plt
from public_tests import *
%matplotlib inline

2 Description of the problem

Let's say you're starting a business that grows and sells wild mushrooms. Since not all mushrooms are edible, you want to be able to tell whether a given mushroom is edible or poisonous based on its physical properties. You have some existing data that you can use for this task. Can you use this data to help you determine which mushrooms are safe to sell? Note: The dataset used is for illustration purposes only. It is not meant as a guide to identifying edible mushrooms.

3 one-hot encoded datasets

Brown Cap Tapering Stalk Shape Solitary Edible
1 1 1 1
1 0 1 1
1 0 0 0
1 0 0 0
1 1 1 1
0 1 1 0
0 0 0 0
1 0 1 1
0 1 0 1
1 0 0 0
therefore,
  • X_trainContains three features for each sample

    • Hat color (a value 1for brown hats, a value 0for red hats)
    • Stem shrinkage (value 1for "tapered" stem, value 0for "expanded" stem)
    • Alone (a value 1of yes, a value 0of no)
  • y_trainare mushrooms edible

    • y = 1means edible
    • y = 0Indicates poisonous
X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])
X_train[:5]
array([[1, 1, 1],
       [1, 0, 1],
       [1, 0, 0],
       [1, 0, 0],
       [1, 1, 1]])

4 decision trees

In this hands-on lab, you will build a decision tree based on a provided dataset.

  • Recall the steps to build a decision tree:

    • Use all samples starting from the root node
    • Calculate the information gain when splitting based on all possible features and select the feature with the highest information gain
    • Split the dataset based on selected features and create left and right branches of the tree
    • Keep repeating the splitting process until the stopping criterion is met
  • In this lab, you will implement the following function to split a node into left and right branches using the features with the highest information gain:

    • Calculate the entropy on the node
    • Split the dataset into left and right branches at a node based on a given feature
    • Calculate the information gain when splitting on a given feature
    • Select features that maximize information gain
  • We will then use the helper function you implemented to build the decision tree by repeating the splitting process until the stopping criterion is met.

    • For this experiment, the stopping criterion we chose was to set a maximum depth of 2.

4.1 Calculating entropy

First, you'll need to write a compute_entropyhelper function called entropy (a measure of impurity) that computes the entropy (impurity measure) on a node.

  • The function accepts a numpy array ( y) that indicates whether the examples in this node are edible ( 1) or poisonous ( 0)

Please complete the function below compute_entropy()to:

  • Calculate p 1 p_1p1, which is the proportion of edible examples (i.e. yhaving a value = in1
  • Then calculate the entropy

H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=p1log2(p1)(1p1)log2(1p1)

  • Notice
    • Logarithms use base 2 22
    • For convenience, 0 log 2 ( 0 ) = 0 0\text{log}_2(0) = 00 log2(0)=0 . That is, ifp_1 = 0orp_1 = 1, set the entropy to0
    • Make sure to check that the data on the node is not empty (ie len(y) != 0) and return if it is0
def compute_entropy(y):
    entropy = 0
    if len(y) != 0:
        p1 = len(y[y==1])/len(y)
        if p1 != 0 and p1 !=1:
            entropy = -p1 * np.log2(p1)-(1-p1)*np.log2(1-p1)
        else:
            entropy = 0
    return entropy

4.2 Divide the dataset

Next, you'll write a helper function split_datasetcalled that takes the data at a node and the features to split on, and splits it into left and right branches. Later in the lab, you'll implement code that calculates how well the split works.

  • The function takes the training data, a list of indices of the data points on this node, and the features to split on.
  • It splits the data and returns a subset of indices for the left and right branches.
  • For example, suppose we start at the root node (thus node_indices = [0,1,2,3,4,5,6,7,8,9]), and choose the split feature to be 0, i.e., whether the example has a brown hat.
    • Then the output of the function is, left_indices = [0,1,2,3,4,7,9],right_indices = [5,6,8]
index brown hat Shrink handle shape independent edible
0 1 1 1 1
1 1 0 1 1
2 1 0 0 0
3 1 0 0 0
4 1 1 1 1
5 0 1 1 0
6 0 0 0 0
7 1 0 1 1
8 0 1 0 1
9 1 0 0 0
def split_dataset(X,node_indices,feature):
    Lnode_indices = []
    Rnote_indices = []
    for i in node_indices:
        if X[i][feature] ==1:
            Lnode_indices.append(i)
        else:
            Rnote_indices.append(i)
    return Lnode_indices,Rnote_indices

Next, you'll write a function information_gaincalled that takes the training data, the index on the node, and the features to split on, and returns the information gain obtained from the split.

Complete compute_information_gain()the function to calculate

Information gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information gain} = H(p_1^\text{node})- (w^ {\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))information gain=H(p1node)(wleftH(p1left)+wrightH(p1right))

in:

  • H ( p 1 node ) H(p_1^\text{node}) H(p1node) is the entropy of the node
  • H ( p 1 left ) H(p_1^\text{left}) H(p1left) H ( p 1 right ) H(p_1^\text{right}) H(p1right) is the entropy of the left and right branches after splitting
  • w left w^{\text{left}} wleft w right w^{\text{right}} wright is the proportion of examples in the left and right branches respectively

Notice:

  • You can calculate the entropy using compute_entropy()the function
  • We provide some starting code that splits the dataset using split_dataset()the function
def compute_information_gain(X,y,node_indices,feature):
    L_indices,R_indices = split_dataset(X,node_indices,feature)

    X_node, y_node = X[node_indices],y[node_indices]
    X_left, y_left = X[L_indices], y[L_indices]
    X_right, y_right = X[R_indices], y[R_indices]

    node_entropy = compute_entropy(y_node)
    L_entropy = compute_entropy(y_left)
    R_entropy = compute_entropy(y_right)

    left_w = len(X_left)/len(X_node)
    rigth_w = len(X_right)/len(X_node)

    w_entropy = left_w*L_entropy +rigth_w*R_entropy
    information_gain = node_entropy - w_entropy
    return information_gain

Now, let's write a function to get the best split features by computing the information gain for each feature, like we did above, and return the feature that gives the largest information gain.

Please complete get_best_split()the function .

  • The function accepts the training data and the index of the data point at the node
  • The output of the function is the feature that provides the greatest information gain
    • You can iterate over the features and calculate the entropy for each feature using compute_information_gain()the function
def get_best_split(X,y,node_indices):
    num = X.shape[1]
    best_feature = -1
    max_info_gain = 0
    for feature in range(num):
        info_gain = compute_information_gain(X,y,node_indices,feature)
        if info_gain > max_info_gain:
            max_info_gain = info_gain
            best_feature = feature
    return best_feature

5 Build the tree

In this section, we use the function you implemented above to generate a decision tree by sequentially selecting the best features to split on until a stopping condition is reached (maximum depth of 2).

tree = []
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return


    best_feature = get_best_split(X, y, node_indices)
    tree.append((current_depth, branch_name, best_feature, node_indices))

    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))

    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)

    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)
 Depth 0, Root: Split on feature: 2
- Depth 1, Left: Split on feature: 0
  -- Left leaf node with indices [0, 1, 4, 7]
  -- Right leaf node with indices [5]
- Depth 1, Right: Split on feature: 1
  -- Left leaf node with indices [8]
  -- Right leaf node with indices [2, 3, 6, 9]

Guess you like

Origin blog.csdn.net/fayoung3568/article/details/131196724