Andrew Ng machine learning 课程笔记--特征选择

VC dimension:let us say your hypothesis class is because of all linear division boundaries.so say script h is parameterized by d real numbers.if you are applying logistic regression with over n features then d would be endless one with logistic regression to find the linear position boundary,parameterized by endless one real numbers.

Tips:when you think about your hypothesis class is really represented in a computer.computers use zero one bits to represent real numbers. And so if you use like a normal standard computer,it normally will represent real numbers by what is called double position floating point numbers.what  that means is that each real number is represented by 64 or a 64-bit representations.computers can't represent real numbers.they only represent used to speed things.

Shatters:given it any set of labels and you can find a hypothesis that pefectly separater the positive and negative examples.it turns out that you can show that in two dimensions,there is no set of four points that the class of all linear classifiers can shatter.

Definition of vapnik and chervonenkis dimension:given a hypothesis class,the vc dimension of script h,is the size of the largest set that is shatterde by h.and if a hypothesis class can shatter arbitrarily large sets,then the VC dimension is infinite.and it turns out this result holds true into the more generally ,in any dimensions the Vc dimension oof the class of linear classifiers in any diemsions is equal to n plus one.it turns out that for most reasonable hypothesis classes,it turns out that the VC dimension is sort of similar,to the number of parameters you model.

Why SVM not overfit:it turns out that if I consider my data points all liie within some shere of radius r,and if I consider only the course of linear searators is separate to data with a margin of at least gamma,then the VC dimension of this course is less than or equal to r squared over four gamma squared plus one.and it turns out you prove and there are some strange things about this result that I am deliberately not gonna to talk about but turns they can prove that the VC dimension of the class of linear classifiers with large margins is actually bounded .this is the bound on VC dimension that ha no dependencts on the dimension of the points x.

Indicator function:one training example your training example will be positive or negative.it turns out that so what you really llike to do is choose  parameters data so as to minimize this step function.you would like to choose parameters data so that you end up with a correct classification on your training example and so you would like indicator h of x not equal y.it turns out that this step function is clearly a non-convex function.and it turns out that the linear classififiers minimize the training error is a np hard problem.it turns out that both logistic regression,and support vector machine can be viewed as using a convex approximation for this problem.thiis is maybe what you are really trying to minimize ,so you want ti minimize training error.so you can actually think of logistic regression  as trying to approximate empirical risk minimization.where instead of using this step function,which is non-convex,and gives you a hard optimization problem,it uses this line above this curve,so approximate it,so you have a convex optimization problem you can find the maximum likelihood it is in the parameters for logistic regression.

Model selection:we saw that there is often a trade-off between bias and variance.and in particular, so it is important not to choose a hypothesis that is either too simple or too complex.so if your data has sort of a quadratic structure to it,then iif you choose a linear function to try to approximate it ,then you would underfit.so you have a hypothesis with high bias.and conversely,we choose a hypothesis that is too complex,and you have high variance.and you will also fail to fit.then you would over fit the data.another example of a model selection problem would be if you are trying to choose the parameter towl,which was the bandwidth parameter in locally awaited linear regression .another model selection problem is if you are trying to choose the parameter c in the svm,and so one known soft margin is the we had the optimization objective.and the parameter c controls the tradeoff between how mmuch you want to set  for your example.so a large margin versus how much you want to penallize in this class.

Train and test:it's commonly used where you train on 70 percent of the data.then test all of your models and 30 percent,you can take whatever has the smallest hold out cross validation error.and after this you actually have a choice.you can actually having taken all of these hypothesis trained on 70 percent of the data,you can actually just output the hypothesis that has the lowest error on your hold out cross validation set.and optionally,you can actually take the model that you selected and go back,and retrain it on all 100 percent of the data.the many machine-learning applications we have very little data or where.

Cross validation:there are a couple of other variations on hold out cross validation that makes sometimes,slightly more efficient use of the data.and one is called k-fold cross validation.I am gonna draw this box s,as to note the entirety of all the data I have.and I will then divide it into k pieces,and this is five pieces in what I have drawn.then what will I will do is I will repeatedly train on k minus one pieces.test on the remaining one test on the remaining piece.and then you average over the k result.the disadvantage of hold out cross validation is that it can be much more computationally expensive.when your set k equals m,and so that is when you take your training set,and you split it into as many pieces as you have training examples.this is called leave one out cross validation.

A good way to think of these learning theory bounds is and this is why,I quite often use big-O notation to just absolutely just ignore the costant factors because the bounds seen to be very loose.

Feature selection:we would like to select  a subset of the features that may be or hopefully the most selevant ones for a specific learning problem,so as to give ourselves a simple hypothesis to choose from.in feature selection,what we most commonly do is use various searcheristics sort of simple search slgorithms to try to search through this space of two to the n possible subsets of features;to try to find a good subset of features.

Forward search algorithm:it starts with initialize the sets script f to be the empty set,and then repeat for I equals one to n;try adding feature I to the set scripts f,and ecaluate the model using cross validation.when you terminate this a little bit,as if you have added all the features to f,so f is now the entire set of features.

Wrapped feature selection:

Backward search selection:when you start with f equals the entire set of features,and you delete features one at a time;

Filter's deature selection methed:the basci idea is that for each feature I will compute  some measure of how informative xi is about y;one way you can choose is to just compute the correlation between xi and y.and just for each of your features just see how correlated this is with your class label y.and then you just pick the top k most correlated features.another way is that which especially for this k features I guess there is one other informative measure that is used very commonly,shich is called mutual information.so the mutual information between feature xi and y .you then pick the top k features,meaning that you compute correlation between xi and y of all the features of mutual information xi and y for all the features.and then you include in your learning algorithm the k features of the largest correlation with the label or the largest mutual information label,whatever.and to choose k,you can actually use cross validation as well.you would take all your features and sort them in decreasing order of mutual information.and then you would try using just the top one feature,the top two features,the top three features and so on.and you decide how many features includes using cross validation.

K-L divergence that is a formal measure of how differnent two probability distributions are.

猜你喜欢

转载自blog.csdn.net/weixin_43218659/article/details/88543013