decision tree
Implement a decision tree from scratch and apply it to the task of classifying whether a mushroom is edible or poisonous.
1 guide package
import numpy as np
import matplotlib.pyplot as plt
from public_tests import *
%matplotlib inline
2 Description of the problem
Let's say you're starting a business that grows and sells wild mushrooms. Since not all mushrooms are edible, you want to be able to tell whether a given mushroom is edible or poisonous based on its physical properties. You have some existing data that you can use for this task. Can you use this data to help you determine which mushrooms are safe to sell? Note: The dataset used is for illustration purposes only. It is not meant as a guide to identifying edible mushrooms.
3 one-hot encoded datasets
Brown Cap | Tapering Stalk Shape | Solitary | Edible |
---|---|---|---|
1 | 1 | 1 | 1 |
1 | 0 | 1 | 1 |
1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
1 | 1 | 1 | 1 |
0 | 1 | 1 | 0 |
0 | 0 | 0 | 0 |
1 | 0 | 1 | 1 |
0 | 1 | 0 | 1 |
1 | 0 | 0 | 0 |
therefore, |
-
X_train
Contains three features for each sample- Hat color (a value
1
for brown hats, a value0
for red hats) - Stem shrinkage (value
1
for "tapered" stem, value0
for "expanded" stem) - Alone (a value
1
of yes, a value0
of no)
- Hat color (a value
-
y_train
are mushrooms edibley = 1
means edibley = 0
Indicates poisonous
X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])
X_train[:5]
array([[1, 1, 1],
[1, 0, 1],
[1, 0, 0],
[1, 0, 0],
[1, 1, 1]])
4 decision trees
In this hands-on lab, you will build a decision tree based on a provided dataset.
-
Recall the steps to build a decision tree:
- Use all samples starting from the root node
- Calculate the information gain when splitting based on all possible features and select the feature with the highest information gain
- Split the dataset based on selected features and create left and right branches of the tree
- Keep repeating the splitting process until the stopping criterion is met
-
In this lab, you will implement the following function to split a node into left and right branches using the features with the highest information gain:
- Calculate the entropy on the node
- Split the dataset into left and right branches at a node based on a given feature
- Calculate the information gain when splitting on a given feature
- Select features that maximize information gain
-
We will then use the helper function you implemented to build the decision tree by repeating the splitting process until the stopping criterion is met.
- For this experiment, the stopping criterion we chose was to set a maximum depth of 2.
4.1 Calculating entropy
First, you'll need to write a compute_entropy
helper function called entropy (a measure of impurity) that computes the entropy (impurity measure) on a node.
- The function accepts a numpy array (
y
) that indicates whether the examples in this node are edible (1
) or poisonous (0
)
Please complete the function below compute_entropy()
to:
- Calculate p 1 p_1p1, which is the proportion of edible examples (i.e.
y
having a value = in1
- Then calculate the entropy
H ( p 1 ) = − p 1 log 2 ( p 1 ) − ( 1 − p 1 ) log 2 ( 1 − p 1 ) H(p_1) = -p_1 \text{log}_2(p_1) - (1- p_1) \text{log}_2(1- p_1) H(p1)=−p1log2(p1)−(1−p1)log2(1−p1)
- Notice
- Logarithms use base 2 22
- For convenience, 0 log 2 ( 0 ) = 0 0\text{log}_2(0) = 00 log2(0)=0 . That is, if
p_1 = 0
orp_1 = 1
, set the entropy to0
- Make sure to check that the data on the node is not empty (ie
len(y) != 0
) and return if it is0
def compute_entropy(y):
entropy = 0
if len(y) != 0:
p1 = len(y[y==1])/len(y)
if p1 != 0 and p1 !=1:
entropy = -p1 * np.log2(p1)-(1-p1)*np.log2(1-p1)
else:
entropy = 0
return entropy
4.2 Divide the dataset
Next, you'll write a helper function split_dataset
called that takes the data at a node and the features to split on, and splits it into left and right branches. Later in the lab, you'll implement code that calculates how well the split works.
- The function takes the training data, a list of indices of the data points on this node, and the features to split on.
- It splits the data and returns a subset of indices for the left and right branches.
- For example, suppose we start at the root node (thus
node_indices = [0,1,2,3,4,5,6,7,8,9]
), and choose the split feature to be0
, i.e., whether the example has a brown hat.- Then the output of the function is,
left_indices = [0,1,2,3,4,7,9]
,right_indices = [5,6,8]
- Then the output of the function is,
index | brown hat | Shrink handle shape | independent | edible |
---|---|---|---|---|
0 | 1 | 1 | 1 | 1 |
1 | 1 | 0 | 1 | 1 |
2 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 |
4 | 1 | 1 | 1 | 1 |
5 | 0 | 1 | 1 | 0 |
6 | 0 | 0 | 0 | 0 |
7 | 1 | 0 | 1 | 1 |
8 | 0 | 1 | 0 | 1 |
9 | 1 | 0 | 0 | 0 |
def split_dataset(X,node_indices,feature):
Lnode_indices = []
Rnote_indices = []
for i in node_indices:
if X[i][feature] ==1:
Lnode_indices.append(i)
else:
Rnote_indices.append(i)
return Lnode_indices,Rnote_indices
Next, you'll write a function information_gain
called that takes the training data, the index on the node, and the features to split on, and returns the information gain obtained from the split.
Complete compute_information_gain()
the function to calculate
Information gain = H ( p 1 node ) − ( w left H ( p 1 left ) + w right H ( p 1 right ) ) \text{Information gain} = H(p_1^\text{node})- (w^ {\text{left}}H(p_1^\text{left}) + w^{\text{right}}H(p_1^\text{right}))information gain=H(p1node)−(wleftH(p1left)+wrightH(p1right))
in:
- H ( p 1 node ) H(p_1^\text{node}) H(p1node) is the entropy of the node
- H ( p 1 left ) H(p_1^\text{left}) H(p1left) 和 H ( p 1 right ) H(p_1^\text{right}) H(p1right) is the entropy of the left and right branches after splitting
- w left w^{\text{left}} wleft 和 w right w^{\text{right}} wright is the proportion of examples in the left and right branches respectively
Notice:
- You can calculate the entropy using
compute_entropy()
the function - We provide some starting code that splits the dataset using
split_dataset()
the function
def compute_information_gain(X,y,node_indices,feature):
L_indices,R_indices = split_dataset(X,node_indices,feature)
X_node, y_node = X[node_indices],y[node_indices]
X_left, y_left = X[L_indices], y[L_indices]
X_right, y_right = X[R_indices], y[R_indices]
node_entropy = compute_entropy(y_node)
L_entropy = compute_entropy(y_left)
R_entropy = compute_entropy(y_right)
left_w = len(X_left)/len(X_node)
rigth_w = len(X_right)/len(X_node)
w_entropy = left_w*L_entropy +rigth_w*R_entropy
information_gain = node_entropy - w_entropy
return information_gain
Now, let's write a function to get the best split features by computing the information gain for each feature, like we did above, and return the feature that gives the largest information gain.
Please complete get_best_split()
the function .
- The function accepts the training data and the index of the data point at the node
- The output of the function is the feature that provides the greatest information gain
- You can iterate over the features and calculate the entropy for each feature using
compute_information_gain()
the function
- You can iterate over the features and calculate the entropy for each feature using
def get_best_split(X,y,node_indices):
num = X.shape[1]
best_feature = -1
max_info_gain = 0
for feature in range(num):
info_gain = compute_information_gain(X,y,node_indices,feature)
if info_gain > max_info_gain:
max_info_gain = info_gain
best_feature = feature
return best_feature
5 Build the tree
In this section, we use the function you implemented above to generate a decision tree by sequentially selecting the best features to split on until a stopping condition is reached (maximum depth of 2).
tree = []
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
if current_depth == max_depth:
formatting = " "*current_depth + "-"*current_depth
print(formatting, "%s leaf node with indices" % branch_name, node_indices)
return
best_feature = get_best_split(X, y, node_indices)
tree.append((current_depth, branch_name, best_feature, node_indices))
formatting = "-"*current_depth
print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
# Split the dataset at the best feature
left_indices, right_indices = split_dataset(X, node_indices, best_feature)
# continue splitting the left and the right child. Increment current depth
build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)
Depth 0, Root: Split on feature: 2
- Depth 1, Left: Split on feature: 0
-- Left leaf node with indices [0, 1, 4, 7]
-- Right leaf node with indices [5]
- Depth 1, Right: Split on feature: 1
-- Left leaf node with indices [8]
-- Right leaf node with indices [2, 3, 6, 9]