Machine Learning Common interview questions

Article Directory
  1. 1.  differentiated supervised learning and unsupervised learning
  2. 2.  regularization
  3. 3.  Overfitting
    1. 3.1.  The causes
    2. 3.2.  Solution
  4. 4.  generalization
  5. The  generated model and discriminant model
  6. 6. The  difference between the linear and nonlinear classifiers and classifiers merits
    1. 6.1.  When the feature is greater than the amount of data, choose what kind of classification?
    2. 6.2.  For high dimension feature, you choose linear or non-linear classifiers?
    3. 6.3.  For very low dimensional feature, you choose linear or non-linear classifiers?
  7. 7.  ILL-for condition Condition ill-posed problem
  8. 8. The  L1 and L2 regular difference, how to select the L1 and L2 canonical
  9. 9.  normalized eigenvector method
  10. 10. The  abnormal value processing feature vectors
  11. 11. The  smaller the simple model Parameter Description
  12. 12.  Comparison of rbf svm kernel function and a Gaussian function
  13. 13.  KMeans initial clusters by selecting a center point
    1. 13.1.  Select K points as far as possible from the batch of
    2. 13.2.  Selection hierarchical clustering algorithm initial cluster or Canopy
  14. 14. ROC、AUC
    1. 14.1.  ROC curve
    2. 14.2. AUC
    3. 14.3.  Why use ROC and AUC
  15. 15. The  difference between the test set and the training set
  16. 16.  Optimization Kmeans
  17. 17. The  difference data mining and machine learning
  18. 18.  Remarks

Differentiated supervised learning and unsupervised learning

  • Supervised learning: having training samples labeled learning, data outside the training sample set are classified as forecast. (LR, SVM, BP, RF, GBRT)
  • Unsupervised Learning: the sample unlabeled training learning, knowledge structure than found in these samples. (KMeans, DL)

Regularization

Regularization is proposed for over-fitting, that the optimal solution of the model is minimal risk in general optimize the experience, now join the model complexity of this one (regularization term in the empirical risk is the norm model parameter vector ), and using a rate ratio to weigh the complexity of the model and the previous experience of the risks of right, if the model complexity is higher, will experience greater risk structured, the goal now becomes the optimization of the structure of empirical risk can be prevented model training overly complex, effectively reduce the risk of over-fitting.

Occam's Razor, well known interpretation of data is very simple and is the best model.

Overfitting

If you always go to improve the predictive ability of the training data, the complexity of the chosen model tends to be high, a phenomenon known as over-fitting. The performance of the model training when the error is very small, but significant error in the time of testing.

Causes

  1. Because too many parameters, our model will lead to increase in complexity, easy to over-fitting
  2. Weight learning iterations enough (Overtraining), does not fit the typical characteristics of noise and training examples in the training data.

Solution

  1. Cross validation
  2. Reduction feature
  3. Regularization
  4. Weight attenuation
  5. verify the data

Generalization

Generalization refers to the ability of the model to predict the unknown data

Generation model and discriminant model

  1. Generated models: the joint probability distribution of learning data P (X, Y), then the conditional probability distribution is obtained P (Y | X) as a prediction model, i.e., generate the model: P (Y | X) = P (X, Y) / P (X). (Naive Bayes)
    generating a model can restore the joint probability distribution p (X, Y), and have a fast convergence rate of the learning, it can be used for further study of hidden variables
  2. Discriminant models: the data for learning decision function Y = f (X) directly or conditional probability distribution P (Y | X) as a prediction model, i.e., the discriminant model. (K nearest neighbor, decision tree)
    directly to the prediction, the accuracy rate is often higher, direct abstraction of data to varying degrees, it is possible to simplify the model

Linear and nonlinear difference classifier classifier and merits

If the model is a linear function parameters, and there is a linear classification surface, then that is a linear classifier, otherwise not.
Common linear classifiers have: the LR, Bayesian classifier, a single layer perceptron linear regression
common non-linear classifiers: tree, RF, GBDT, multilayer perceptron

SVM both come in (see nuclear or linear Gaussian kernel)

  • Linear classifier speed, easy programming, but the effect may not be very good fit
  • Non-linear classifier complexity of programming, but the ability to fit and strong effect

When the feature is greater than the amount of data, choose what kind of classification?

Linear classifiers, because when high dimensional data is generally sparse dimensions of space there would be more likely linearly separable

For high dimension feature, you choose linear or non-linear classifiers?

Ditto

For low dimensional feature, you choose linear or non-linear classifiers?

Linear classifiers, because the low-dimensional space may be many features have come together, resulting in linearly inseparable

ill-condition ill-posed problem

Training finished model slightly modified test sample will get very different results, it is ill-posed problem (which is simply not ah)

The difference between L1 and L2 is regular, how to select L1 and L2 canonical

They are over-fitting to prevent, reduce the complexity of the model

  • L1 is coupled behind the norm loss function (i.e. | XI |) model parameters
  • L2 is behind the loss function plus 2 norm (i.e. sigma (xi ^ 2)) of the model parameters, note that the definition of L2 norm is sqrt (sigma (xi ^ 2)), is not added sqrt root in the regularization terms In order to optimize the number is much easier

  • L1 produces sparse feature

  • Wherein L2 is generated but are more close to 0

L1 will tend to produce a small number of features, and other features are 0, L2 and choose more characteristics that are close to zero. When selecting L1 feature is very useful, but it is only a rule of L2 only.

Normalized eigenvector method

  1. A linear conversion function, the following expression: y = (x-MinValue) / (MaxValue-MinValue)
  2. Logarithmic conversion, the following expression: y = log10 (x)
  3. Conversion inverse cotangent function, the following expression: y = arctan (x) * 2 / PI
  4. Subtracting the mean, variance multiplied by: y = (x-means) / variance

Outlier handling feature vectors

  1. Instead of using the mean or other statistics

The smaller the parameter description simple model

Over-fitting of the fitting will go through each point of the surface, that may have a greater curvature smaller intervals inside the derivative here is great, the linear model which is the derivative of weight, so the more small parameter description the simpler model.

Additional: This can actually see the dimensional VC-related things feel more appropriate

In comparison svm rbf Gaussian kernel function and

It looks like a Gaussian kernel RBF kernel

KMeans initial cluster center point selected class

Select the K point as far as possible from the batch of

首先随机选取一个点作为初始点,然后选择距离与该点最远的那个点作为中心点,再选择距离与前两个点最远的店作为第三个中心店,以此类推,直至选取大k个

Selection hierarchical clustering algorithm initial cluster or Canopy

ROC、AUC

ROC and AUC is often used to evaluate the quality of a binary classifier

ROC curve

Curve coordinates:

  • The X axis is the FPR (for false positive rate - the prediction result positive, but the actual result negitive, FP / (N))
  • TPR Y axis (represented by the true positive rate - positive predictive results, and indeed a positive result is also true, TP / P)

Then point plane (X, Y):

  • (0,1) represents all the positive samples are predicted, the best classification
  • (0,0) represents a prediction of the result all negitive
  • (1,0) represents a prediction missed the whole part wrong, the worst classification
  • The results (1,1) represents a prediction of all positive

    For x = y falling on the point, it is represented by the result of random guessed

ROC curve to establish
general predicted there will be after the completion of a default probability output p, the higher the probability, the greater the probability of it positive.
Now suppose we have a threshold, if p> threshold, then the predicted result is positive, otherwise negitive, according to this idea, we have set up more than a few threshold, then we can get the result of multiple sets of positive and negitive, that is, we can get multiple sets of TPR and FPR values of
these (FPR, TPR) coordinate point onto the line and then connect the ROC curve is

When the threshold takes 0 and 1, respectively is (0,0) and (1,1), the two points. (Threshold = 1, all the samples the predicted negative samples, threshold = 0, all the samples predicted positive samples)

AUC

AUC (Area Under Curve) is defined as the area under the ROC curve, this area is obviously not greater than 1 (above the ROC will generally x = y, so 0.5 <AUC <1).

The larger the better classification results AUC

Why use ROC and AUC

Because when the test set of positive and negative samples changes, ROC curve can basically unchanged, but the precision and recall might have greater volatility.
http://www.douban.com/note/284051363/?type=like

The difference between the test set and the training set

Training set used to build capacity model, to predict the test set to evaluate the model, etc.

Optimization Kmeans

Using the kd-tree or ball tree (tree understand this)
will build all instances observations into a kd-tree, is required prior to each cluster center and each observation point distance calculation done sequentially, according to the cluster center which is now a kd-tree only you need to calculate a local region near to

The difference between data mining and machine learning

Machine learning is an important tool for data mining, but not this type of data mining methods, there are many other non-machine learning and only machine learning, such as graph mining, frequent item mining. Data mining is the feeling purposes, but in terms of machine learning from the method.

Remark

The main topics from the network, the answer comes mainly from the network or "statistical learning methods", as well as a small part of their own summary, if the wrong place please indicate
on the Eastern common model can see this if you want to know the machine learning common algorithm personal summary (interview with) articles

Guess you like

Origin www.cnblogs.com/cmybky/p/11772866.html