- 1. differentiated supervised learning and unsupervised learning
- 2. regularization
- 3. Overfitting
- 4. generalization
- The generated model and discriminant model
- 6. The difference between the linear and nonlinear classifiers and classifiers merits
- 7. ILL-for condition Condition ill-posed problem
- 8. The L1 and L2 regular difference, how to select the L1 and L2 canonical
- 9. normalized eigenvector method
- 10. The abnormal value processing feature vectors
- 11. The smaller the simple model Parameter Description
- 12. Comparison of rbf svm kernel function and a Gaussian function
- 13. KMeans initial clusters by selecting a center point
- 14. ROC、AUC
- 15. The difference between the test set and the training set
- 16. Optimization Kmeans
- 17. The difference data mining and machine learning
- 18. Remarks
Differentiated supervised learning and unsupervised learning
- Supervised learning: having training samples labeled learning, data outside the training sample set are classified as forecast. (LR, SVM, BP, RF, GBRT)
- Unsupervised Learning: the sample unlabeled training learning, knowledge structure than found in these samples. (KMeans, DL)
Regularization
Regularization is proposed for over-fitting, that the optimal solution of the model is minimal risk in general optimize the experience, now join the model complexity of this one (regularization term in the empirical risk is the norm model parameter vector ), and using a rate ratio to weigh the complexity of the model and the previous experience of the risks of right, if the model complexity is higher, will experience greater risk structured, the goal now becomes the optimization of the structure of empirical risk can be prevented model training overly complex, effectively reduce the risk of over-fitting.
Occam's Razor, well known interpretation of data is very simple and is the best model.
Overfitting
If you always go to improve the predictive ability of the training data, the complexity of the chosen model tends to be high, a phenomenon known as over-fitting. The performance of the model training when the error is very small, but significant error in the time of testing.
Causes
- Because too many parameters, our model will lead to increase in complexity, easy to over-fitting
- Weight learning iterations enough (Overtraining), does not fit the typical characteristics of noise and training examples in the training data.
Solution
- Cross validation
- Reduction feature
- Regularization
- Weight attenuation
- verify the data
Generalization
Generalization refers to the ability of the model to predict the unknown data
Generation model and discriminant model
- Generated models: the joint probability distribution of learning data P (X, Y), then the conditional probability distribution is obtained P (Y | X) as a prediction model, i.e., generate the model: P (Y | X) = P (X, Y) / P (X). (Naive Bayes)
generating a model can restore the joint probability distribution p (X, Y), and have a fast convergence rate of the learning, it can be used for further study of hidden variables - Discriminant models: the data for learning decision function Y = f (X) directly or conditional probability distribution P (Y | X) as a prediction model, i.e., the discriminant model. (K nearest neighbor, decision tree)
directly to the prediction, the accuracy rate is often higher, direct abstraction of data to varying degrees, it is possible to simplify the model
Linear and nonlinear difference classifier classifier and merits
If the model is a linear function parameters, and there is a linear classification surface, then that is a linear classifier, otherwise not.
Common linear classifiers have: the LR, Bayesian classifier, a single layer perceptron linear regression
common non-linear classifiers: tree, RF, GBDT, multilayer perceptron
SVM both come in (see nuclear or linear Gaussian kernel)
- Linear classifier speed, easy programming, but the effect may not be very good fit
- Non-linear classifier complexity of programming, but the ability to fit and strong effect
When the feature is greater than the amount of data, choose what kind of classification?
Linear classifiers, because when high dimensional data is generally sparse dimensions of space there would be more likely linearly separable
For high dimension feature, you choose linear or non-linear classifiers?
Ditto
For low dimensional feature, you choose linear or non-linear classifiers?
Linear classifiers, because the low-dimensional space may be many features have come together, resulting in linearly inseparable
ill-condition ill-posed problem
Training finished model slightly modified test sample will get very different results, it is ill-posed problem (which is simply not ah)
The difference between L1 and L2 is regular, how to select L1 and L2 canonical
They are over-fitting to prevent, reduce the complexity of the model
- L1 is coupled behind the norm loss function (i.e. | XI |) model parameters
-
L2 is behind the loss function plus 2 norm (i.e. sigma (xi ^ 2)) of the model parameters, note that the definition of L2 norm is sqrt (sigma (xi ^ 2)), is not added sqrt root in the regularization terms In order to optimize the number is much easier
-
L1 produces sparse feature
- Wherein L2 is generated but are more close to 0
L1 will tend to produce a small number of features, and other features are 0, L2 and choose more characteristics that are close to zero. When selecting L1 feature is very useful, but it is only a rule of L2 only.
Normalized eigenvector method
- A linear conversion function, the following expression: y = (x-MinValue) / (MaxValue-MinValue)
- Logarithmic conversion, the following expression: y = log10 (x)
- Conversion inverse cotangent function, the following expression: y = arctan (x) * 2 / PI
- Subtracting the mean, variance multiplied by: y = (x-means) / variance
Outlier handling feature vectors
- Instead of using the mean or other statistics
The smaller the parameter description simple model
Over-fitting of the fitting will go through each point of the surface, that may have a greater curvature smaller intervals inside the derivative here is great, the linear model which is the derivative of weight, so the more small parameter description the simpler model.
Additional: This can actually see the dimensional VC-related things feel more appropriate
In comparison svm rbf Gaussian kernel function and
It looks like a Gaussian kernel RBF kernel
KMeans initial cluster center point selected class
Select the K point as far as possible from the batch of
首先随机选取一个点作为初始点,然后选择距离与该点最远的那个点作为中心点,再选择距离与前两个点最远的店作为第三个中心店,以此类推,直至选取大k个
Selection hierarchical clustering algorithm initial cluster or Canopy
ROC、AUC
ROC and AUC is often used to evaluate the quality of a binary classifier
ROC curve
Curve coordinates:
- The X axis is the FPR (for false positive rate - the prediction result positive, but the actual result negitive, FP / (N))
- TPR Y axis (represented by the true positive rate - positive predictive results, and indeed a positive result is also true, TP / P)
Then point plane (X, Y):
- (0,1) represents all the positive samples are predicted, the best classification
- (0,0) represents a prediction of the result all negitive
- (1,0) represents a prediction missed the whole part wrong, the worst classification
- The results (1,1) represents a prediction of all positive
For x = y falling on the point, it is represented by the result of random guessed
ROC curve to establish
general predicted there will be after the completion of a default probability output p, the higher the probability, the greater the probability of it positive.
Now suppose we have a threshold, if p> threshold, then the predicted result is positive, otherwise negitive, according to this idea, we have set up more than a few threshold, then we can get the result of multiple sets of positive and negitive, that is, we can get multiple sets of TPR and FPR values of
these (FPR, TPR) coordinate point onto the line and then connect the ROC curve is
When the threshold takes 0 and 1, respectively is (0,0) and (1,1), the two points. (Threshold = 1, all the samples the predicted negative samples, threshold = 0, all the samples predicted positive samples)
AUC
AUC (Area Under Curve) is defined as the area under the ROC curve, this area is obviously not greater than 1 (above the ROC will generally x = y, so 0.5 <AUC <1).
The larger the better classification results AUC
Why use ROC and AUC
Because when the test set of positive and negative samples changes, ROC curve can basically unchanged, but the precision and recall might have greater volatility.
http://www.douban.com/note/284051363/?type=like
The difference between the test set and the training set
Training set used to build capacity model, to predict the test set to evaluate the model, etc.
Optimization Kmeans
Using the kd-tree or ball tree (tree understand this)
will build all instances observations into a kd-tree, is required prior to each cluster center and each observation point distance calculation done sequentially, according to the cluster center which is now a kd-tree only you need to calculate a local region near to
The difference between data mining and machine learning
Machine learning is an important tool for data mining, but not this type of data mining methods, there are many other non-machine learning and only machine learning, such as graph mining, frequent item mining. Data mining is the feeling purposes, but in terms of machine learning from the method.
Remark
The main topics from the network, the answer comes mainly from the network or "statistical learning methods", as well as a small part of their own summary, if the wrong place please indicate
on the Eastern common model can see this if you want to know the machine learning common algorithm personal summary (interview with) articles