[GreedyAI Assignment 2] Lessons from the fourth week

Decision tree and random forest

The decision tree is a tree structure similar to a flowchart (similar to a binary tree) and supports nonlinear operations.

1. An example of a decision tree

How to judge how to make a decision?

Method one: the method of information entropy

Conditional entropy: Under fixed conditions, the degree of uncertainty of information

Information entropy: the degree of uncertainty of information

Information gain: the amount of change in information entropy

Information entropy example: coin flip

How to make decisions? The relationship between information entropy, conditional entropy and information gain.

For example, predict whether to play golf,

First calculate the correlation between each condition and whether to play golf, and simply visualize the relationship between the factors that affect the decision.

Then calculate the information entropy between various factors and whether to play golf.

Then, by calculating the information entropy of other factors under each condition, the option with the largest conditional information entropy is found as the correct option under this condition.

Until the division is completed, a complete decision tree is formed.

One of the shortcomings of decision trees is the over-fitting problem. In the above example, all the data is useful. From the modeling sample data, the classification is very accurate, but it is often different from reality, so over-fitting is easy to occur. Become answering questions for the sake of answering questions.

Therefore, it is easy to cause huge changes in the entire tree due to slight changes in sample data. And it does not apply to the situation where the data types are unevenly distributed.

Second, random forest

Random forest is an ensemble learning method. Create a forest in a random way. There are many decision trees in the forest, and there is no correlation between each decision tree in the random forest. After getting the forest, when a new input sample enters, let each decision tree in the forest make a judgment separately to see which category the sample belongs to (for the classification algorithm), and then see which If one category is selected the most, then predict that sample belongs to that category.

Random forest is a relatively popular algorithm recently, it has many advantages:

  • It performs well on the data set. The introduction of two randomness (random data sampling and random voting) makes the random forest not easy to fall into over-fitting and has good anti-noise ability
  • It can handle data with very high dimensions (many features), and does not require feature selection. It has strong adaptability to data sets: it can handle both discrete data and continuous data, and the data set does not need to be standardized
  • A Proximities=(pij) matrix can be generated to measure the similarity between samples: pij=aij/N, aij represents the number of times that samples i and j appear in the same leaf node in the random forest, and N trees in the random forest Number of
  • When creating a random forest, unbiased estimation is used for the generation error
  • The training speed is fast, and the importance of variables can be ranked (two types: increase based on OOB misclassification rate and decrease based on GINI when split
  • In the training process, the mutual influence between features can be detected
  • Easy to parallelize
  • Relatively simple to implement

The random forest is not much different from the decision tree, and there is a process of multi-tree construction and voting.

Multi-tree construction: Each decision tree in a random forest is not related; the construction of each tree is carried out in the same way as the decision tree.

Voting process: weighted average, etc.

K-means:

The k-means clustering algorithm is an iterative clustering analysis algorithm. Its steps are to randomly select K objects as the initial cluster centers, and then calculate the difference between each object and each seed cluster center. The distance between each object is assigned to the cluster center closest to it, and the calculation is repeated until all points are used as the center point.

K-Means clustering is a method commonly used to automatically divide a data set into K groups, and it is an unsupervised learning algorithm.

Business purpose

This is a general algorithm that can be used for any type of grouping. Some use cases are as follows:

  • Behavioral segmentation: Segment by purchase history, segment by activity on application, website or purchase platform.
  • Inventory classification: Group inventory according to sales activities (preparation inventory).
  • Sensor measurement: Detect the type of activity in the motion sensor and group the images.
  • Detect robots or anomalies: Separate effectively active groups from robots.

k-means clustering algorithm:

  • Step 1: Choose the number K of clusters.
  • Step 2: Randomly select K points as the centroid. (It is not necessary to choose from your data set)
  • Step 3: Assign each data point to -> the nearest centroid of the K cluster.
  • Step 4: Calculate and reposition the new centroid of each cluster.
  • Step 5: Reallocate each data point to the nearest centroid. If any reset occurs, go to step 4, otherwise go to FIN.

Guess you like

Origin blog.csdn.net/u010472858/article/details/96629118