Machine learning knowledge combing II

Common Models and Algorithms

 

1. The linear model Y=WX + b is simple, easy to interpret (affected by multiple factors at the same time), and can be used by complex algorithms

1. Solving method, using the least squares method (linear regression), also called perceptron

2. Generalized Linear Regression: Function of Linear Model y = g(WX+b)

3. The leap function y = 1/(1 + e^-z) Logarithmic linear regression approximates logarithmic probability regression

4. Linear discriminant analysis LDA (linear discriminant analysis) Try to project the sample onto a straight line and use the point estimate of the covariance and mean to calculate

5. The principle of maximum entropy: when the distribution is unknown, it is considered to be a uniform distribution (consider the known constraints first), then the uncertainty of the random variable is the largest, and the entropy is the largest (representing the largest amount of information in information theory). In this case, the predicted risk is minimal.

6. The maximum entropy model is also a log-linear model, and the quasi-Newton method can be used to solve it

2. Decision tree, that is, a tree model (the result is a symbolic model of conditional probability)

1. The essence is to summarize all key situations into rules, such as what to do when certain attributes meet what conditions (a very intuitive specification on data symbols)

2. Construction process: recursive selection attributes, split conditions (branch), pruning

3. The basis for selecting attributes, information gain: features that have a greater impact on the effect of classification

4. ID3: Take the feature with the largest information gain each time for classification recursion, like to use id and other finer divisions, but it will lead to overfitting

5. C4.5 Improvement of ID3, using the information gain ratio as the basis for selecting features 

6. CART algorithm

3. Neural Networks, i.e. mesh and hierarchical models

1. Neural network is an extensive parallel interconnected network composed of simple adaptive units, and its organization can simulate the interactive response of biological nervous system to real-world objects.

2. A single neuron has multiple input and single output, and activates when it exceeds the threshold. In order to use differential methods such as derivatives, the step function is y = 1/(1 + e^-z)

3. Perceptron: only one layer of input and one layer of output

4. Multi-layer feedforward neural network: one input layer, multiple hidden layers, and one output layer; there is no connection between nodes in the layer, the previous layer is fully connected to the next layer, and there is no cross-layer connection

multi-layer feedforward neural networks

5. Error backpropagation algorithm error backpropagation The generalized perceptron learning rule recursively processes each parameter, so that all parameters are updated in each iteration.

The process of one iteration: use a point, calculate from the input layer to the output layer, calculate the error in the reverse direction (starting from the output layer), and adjust the parameter values ​​of the network in the previous layer according to the error.

6. The cumulative BP algorithm calculates the values ​​of all samples in one iteration.

7. Prove that there are enough hidden layers to simulate arbitrarily complex continuous functions with arbitrary precision; it is uncertain how many are needed

8. The expression ability is sufficient, but there is an overfitting situation. The solution: add a penalty term wi^2

9. Jump out of the local minimum method: set multiple groups of initial values; stochastic gradient; simulated annealing; genetic algorithm (all lack theoretical guarantees)

10. RBF radial basis function radial basis function, single hidden layer feedforward network, activation function is radial

With enough neurons, any continuous function can be approximated

ART adaptive resource theary Competitive learning, winner takes all

SOM self-organizing map

The network structure of the cascade correlation network also changes (it is easy to overfit when the data is small)

Elman network recurrent neural network can reverse the input, used to realize the sequence

11. Compared with svm, the theory is not clear enough, and there are a lot of tricks in practice

4. Deep learning (feature or representation learning) DBN RBM CNN graphics, sound, etc. work well

1. The more parameters, the more data required and the more complex the calculation, which is supported by cloud computing and big data.

2. When the number of layers is very large, there is no way to use BP (non-convergence)

3. Pre-training (only one layer is trained at a time) + BP first finds the local optimum, and then fine-tunes the whole

4. Convolutional neural network CNN uses weight sharing to accelerate learning. The essence is layer-by-layer specification, feature expression, and the final form becomes very simple. It can be said to be the automation of feature extraction.

 

Five, support vector machine (text processing effect is very good)

1. When the data is linearly separable, all data can be divided by a plane

2. Learning strategy: interval maximization r = min{yi(w.xi + b)/||w||} 

3. Support vector: take the point with the smallest geometric interval

4. The data does not satisfy the linear separability, and can be projected to a high-dimensional space, where it is linearly separable in the high-dimensional space (feature space) (such a space must be found).

This mapping to the feature space is called the kernel function, and the quality of the kernel function determines the overall effect and efficiency.

5. Common kernel functions: linear, polynomial, Gaussian, Laplace, sigmoid

6. If the data does not satisfy linear separability (there are a few singular points), a penalty term can be added

7. Support vector regression: establish an isolation zone

8. The optimal model learned by svm can be expressed as a linear combination of kernel functions. This method of solving and extending svm is called kernel skills

 

6. Bayesian classifier

1. Using Bayes' theorem for classification, the difficulty is that the posterior probability requires the combined effect of multiple attributes

2. Naive Bayes: Assume that each attribute independently affects the distribution of the result

3. Semi-Naive Bayes: An attribute is only related to another attribute

4. Bayesian network, using a directed acyclic graph to describe the dependencies between attributes

5. EM algorithm: solve the problem of incomplete observations, that is, the conditions of some attributes are unknown

 

7. Integrated learning

1. The basic idea is to combine multiple simple classifiers to ensure the accuracy and diversity of simple classifiers. In practice, because the samples are the same, it is difficult to satisfy 2 points at the same time.

2. Boosting trains one first, then the next one focuses on the wrongly classified sample, and finally linearly combines all the classifiers

3. Bagging: The method to ensure independence is data independence. Idea: Put it back after each random draw

4. Random forest: The attribute selected when building a decision tree is not optimal, but the optimal one in a random subset ---------- works well

5. Combination strategy: simple average, weighted average; absolute majority voting, relative majority voting; re-learning method (deep learning)

 

8. Clustering

1. Let the data be clustered together, that is, to generate the concept of categories

2. For ordered attributes (such as numerical classes), various distances can be used to mark classes; for unordered data, the VDM method is used

3. Prototype clustering: k-means, random initialization, recalculate the center after each addition of data; LVQ: assumed to be labeled; Gaussian mixture distribution

4. Density Clustering: SCAN

5. Hierarchical clustering: use a tree structure

 

9. Dimensionality reduction and measurement

1. k-nearest neighbor learning: divide the categories of test data from the k neighbors around the training set (assuming that all categories are already included)

2. Dimensionality reduction: The data itself is in high dimensions, but low dimensions are actually useful. Low dimensions are convenient for calculating distances and discovering laws. Distance invariance can be expressed using transformation matrices, eigenvectors, etc.

3. PCA principal component analysis Principal component analysis, take out the first k eigenvectors

4. Kernelized Linear Dimensionality Reduction

5. Manifold Learning: Using Topology

6. Metric Learning

 

10. Feature selection and sparse learning

1. There are too many features (attributes), which features need to be found first?

2. Filter selection, package selection

 

11. Deep Learning Unsupervised Feature Learning

http://blog.csdn.net/zouxy09/article/details/8775524

1. Human cognitive mode: The theory believes that human cognitive mode, the way of doing things is stored in the connection between neurons and neurons, called "neuron connection weight", the neural layout of the human brain is similar to a network structure , the neurons are the intersections of the network, and the weights are the connections of the network. These connections are thick or thin, that is, the weights are different in size. The learning ability of human beings is to constantly change the value of the weight, thereby changing their own cognitive mode and way of doing things. After the neuron wiring is enlarged or reduced, it becomes the final signal with different focus

2. The human brain has about 100 million layers of neurons (the human brain is a collection of multitasking processors, and some specific tasks, such as face recognition, only need to use a certain part of the brain). One would then wonder if more hidden layers would have a higher learning effect. It turns out that this is the case. With the increase of the number of hidden layers, the recognition rate of some pictures and speech is getting higher and higher. (expression of more abstract knowledge)

3. Features: It is a process of continuous iteration and continuous abstraction. The high-level features are the combination of low-level features. 1) The multi-hidden layer artificial neural network has excellent feature learning ability, and the learned features have a more essential description of the data, which is conducive to visualization or classification; 2) The difficulty of training deep neural networks can be determined by layer-wise pre-training to effectively overcome. "Deep model" is the means, and "feature learning" is the end.

4. Neural network algorithm: a simulation algorithm for human learning ability, its characteristic is not to find the function mapping relationship or joint distribution law, but to record the relationship similar to the function mapping through the node weights and biases of the neural network. and expression, the real function analytic expression cannot be written directly in many cases (so it is called ai black box)

5. Advantages: The most difficult of other machine learning methods is preprocessing such as dimension reduction, feature selection and labeling, but deep learning is characterized by automatic extraction of low-level or high-level features required for classification (using big data to learn features) . The trained model can represent the original data well. At this time, we can further train the trained data on the supervised data to meet the new needs. (suitable for problems with inconspicuous features such as images and speech).

6. Disadvantages: The more the number of hidden layers, the more complicated the training process is, and the error will be attenuated during the multi-layer transmission, resulting in the Gradient Vanish problem, and eventually the training results will converge to the local optimum or difficult to converge.

7. The relationship with the neural network: The creativity of dl lies in auto-encoding and sparsity. Auto-encoding means that dl treats the input as training data and the output is still training data, so that if the number of intermediate nodes is relatively small, it is equivalent to ignoring the data. The function of compression (imagine that the input is a file, the middle layer is equivalent to a rar compressed file, the output is still an input file, so as long as you know the information of the middle layer rar and the structure of the entire model, you can know what the input is, this is equivalent to Compressed file). The other is sparsity, that is, if I want to have more intermediate nodes and don’t want the training to be too overfit, I can add a limit, assuming that each training data activates as few intermediate nodes as possible (equivalent to simulating the brain, the brain does not Not all neurons can respond to each input, and a small number of neurons must respond).

8. The problem of BP algorithm: BP cannot be used, because for a deep network (above 7 layers), the residual propagation to the front layer has become too small, and the so-called gradient diffusion occurs.

(1) The gradient is getting sparser and sparser: the further down from the top layer, the smaller the error correction signal is;

    (2) Convergence to a local minimum: especially when starting far from the optimal region (random value initialization can cause this to happen);

    (3) Generally, we can only use labeled data for training: but most of the data is unlabeled, and the brain can learn from unlabeled data;

 

    9. Deep learning training process

One is to train one layer of the network at a time, and the other is to tune so that the high-level representation r generated by the original representation x upwards is as consistent as possible with the x' generated by the high-level representation r downwards. the way is:

1) First build a single layer of neurons layer by layer, so that each time a single layer network is trained.

2) When all layers are trained, Hinton uses the wake-sleep algorithm for tuning.

       Change the weights between the other layers except the top layer to be bidirectional, so that the top layer is still a single-layer neural network, and the other layers become graph models. The upward weight is for "cognitive" and the downward weight is for "generating". All weights are then adjusted using the Wake-Sleep algorithm. To make cognition and generation agree, that is, to ensure that the top-level representation generated can restore the underlying nodes as accurately as possible. For example, a node at the top level represents a face, then the images of all faces should activate this node, and the resulting image generated downwards should be able to represent an approximate face image. The Wake-Sleep algorithm is divided into two parts: wake and sleep.

1) The wake stage: the cognitive process, which generates an abstract representation (node ​​state) of each layer through external features and upward weights (cognitive weights), and uses gradient descent to modify the downward weights (generating weights) between layers. That is, "if reality is different from what I imagined, change my weights so that what I imagine is like this".

2) Sleep stage: the generation process, through the top-level representation (the concepts learned when waking up) and the downward weight, the bottom-level state is generated, and the upward weight between layers is modified at the same time. That is, "if the image in the dream is not the corresponding concept in my head, change my cognitive weights so that the image is the concept in my mind".

10. Common models or methods

Idea: Add a process of encode and decode

1" Sparse Coding sparse coding: find sparse basis vectors, use EM algorithm

2"Denoising AutoEncoders: Add noise to the input data to enhance the generalization ability of encode

3" Restricted Boltzmann Machine (RBM) Restricted Boltzmann Machine: The hidden layer and the input layer satisfy the probability distribution

4" Deep Belief Networks Deep Belief Network

5" Convolutional Neural Networks (CNN) Convolutional neural network is a multi-layer perceptron specially designed for recognizing two-dimensional shapes. This network structure is highly invariant to translation, scaling, tilting or other forms of deformation.

 

 

Twelve Summary

1. Relationship:

Linear -> log-linear -> maximum entropy

Decision Tree->C4.5->CART

Neural Network->BP->Deep Learning->DBN RBM CNN

Linearly separable -> SVM -> Kernel trick 

Naive Bayes->Semi-Naive Bayes->Bayesian Network->EM algorithm

boosting + bagging -> random forest

Dimensionality reduction->feature selection and sparse learning->clustering->deep learning

2. A brief summary

svm, bayesian, deep learning

SVM: The algorithm is complex and the data depends on the small neural network: the algorithm is not complicated and depends on the data (not formula but similar to image thinking)

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326772282&siteId=291194637