Andrew Ng machine learning 课程笔记--顺序最小优化算法

Kernels:maybe this is the living area of a house that you are trying to make a prediction on,like whether it will be sold in the next six months.quite often,we will atke this feature X and we will map it to a richer set of features.so for example,we will take X and map it to these four polynomial features,and let me actually call this mapping phi.as we will let phi of X denote the mapping from your original features to some higher dimensional set of features.if you do this and you want to use the features phi of X.then all you needto do is go back to the learning algorithm and everywhere you see XI,xj,we will replace it with the inner product between phi of XI and phi of XJ.so this corresponds to running a support vector machine with the features given by phi of X rather than with your original one-dimensional input feature X.and in a scenario that I want to consider,sometimes phi of X will be very high dimensional,and in fact sometimes phi of X;so for example,phi of X can contain very high degree polynomial features.sometimes phi of X will actually even be an infinite dimensional vector of features,and our question id if phi of X is an extremely high dimensional,then you can't actually compute to these inner products very efficiently ,it seems,because computers need to represent an extremly high dimensional feature vector   and then take inefficient.it turns out that in many important special cases,we can write down,let us call the kernel function,denoted by K,whhich will be this,which would be inner product between those feature vectors.it turns out there will be important special cases where computing phi of X is computationally very expensive:maybe is impossible.there is an infinite dimensional vector,and you can't compute infinite dimensional vectors.there will be some important special cases,where phi of X is very expensive to represent because it is so high dimensional,but nonetheless.let us say you have two inputs,x and z .normally I should write those as XI and XJ,but I am just going to write X and Z to save on writing.let us say my kernek is K of X,Z equals X transpose Z squared.and so this is : right?X transpose Z: this thing here is X transpose Z and this is X transpose Z,so this is X transpose Z squared.and that is equal to that.And so this kernel corresponds to the feature mapping where phi of X is equal to :and I will write this down .for the case of N equals free,I guess.you can verify for yourself that this thing becomes the inner product between phi of X and phi of Z,because to get an inner product between two vectors is;you can just take a sum of the corresponding elements of the vectors.you multiply them.so iif this is phi of X,then the inner product between phi of X and phi of Z will be the sum over all the elements of this vector times the corresponding elements of phi of Z,and what you get is this one.and so the cool thing about this is that in order to compute phi of X,you need just to compute phi of X.if N is a dimensional of X and Z,then phi of X is a vector of all pairs of XI XJ multiplied by each other.and so the length of phi of X is N squared.you need order N squared time just to compute phi of X.but to compute K :is to compute the kernel function,all you need is order N time,because the kernel function is defined,as X transpose Z squared,so you just take the inner product between X and Z,which is order N time and you square that and you have computed this kernel function,and so you just computed the inner productbetween two vectors where each vector has N squared elements but you did it in N square time.

Generalizations:if you define KXZ to be equal to X transpose Z plus C squared,so again,you can compute this kernel in order and time then that turns out to correspond to a feature vector where I am just going to add a few more elements at the bottom where you add root 2.let me read that.that was root 2 CX1 root 2 CX2 root 2 CX3 and C.and so this is a way of creating a feature vector with both the monomials,meaning the first order terms,as well as the quadratic or the inner product terms between XI and XJ,and the parameter C allows you to control the relative waiting between the monomial terms ,so the first order terms,and the quadratic terms.again ,this is still inner product between vectors of length and square in order N time.more generally,here are some other examples of kernels.actually,a generalization of the one I just derived right now would be the following kernel.and so this corresponds to using all N plus DQZ features of all monomials.monomials just mean the products of XI XJ XK.just all the polynomial terms up to degree D and plus so on the order of N plus D to the power of D,so this grows exponentially in D.this is very high dimensional feature vector,but again,you can implictly construct the feature vector and take inner products between them.It's very computationally efficient,because you just compute the inner product between X and Z,add C,and you atke that real number to the power of D and by plugging this as akernel,you are implictly working in a extremely high dimensional computing space.so what I have given is just a few specific examples of how to create kernels.I want to go over just a few specific examples of kernels.so let us ask you more generally if you are faced with a new machine learning problem,how do you come up with a kernel?there are many ways of think about it,but here is one intuition that is sort of useful.so give a set of attributes of X,you are going to use a feature vector of phi of X and given a set of attributes of Z,you are going to use an input feature vector phi of Z,and so the kernel is computing the inner product between phi of X and phi of Z.and so one intuition:this is a partial intuition.this isn't as rigorous intuition that it is used for.it's that if X and Z are very similar,then phi of X and phi of Z will be pointing in the same direction,and therefore the inner product would be large.whereas in contrast,if X and Z are very dissimilar,then phi of X and phi of Z may be pointing different directions,and so the inner product may be small.that intuition is not a rigorous one,but it's sort od a useful one to think about.and so if you are faced with a new learning problem;if I give you some random thing to classify and you want to decide how to come up with a kernel,one way is to try to come up with the function P of XZ that is large,if you want to learn the algorithm to think of X and Z as similar and small.again,this isn't always true,but this is one of several instuitions.so if you are trying to classify some brand new thing;you are trying to classify ,one thing you could do is try to come up with a kernel that ia large when you want the algorithm to think these are similar things or these are dissimilar.and so this answers the question of for example  he is say I have something I want to classify,and let us say I write down the function that I think is a good measure of how similar or dissimilar X and Z are for my specific problem.Let's say I write down K of XZ equals E to the minus.and I think this is a good measure of how similar X  and Z are,is there really exist such phi that KXZ is equal to the inner product?it turns out that there is a result that characterizes necessary and sufficient conditions for one functions that you might choose is a valid kernels.I should go ahead show part of that result now.Suppose K is a valid kernel,and when I say K is a kernel,what I mean is there does indeed exist some function phi for which this holds true.then let any set of points XI up to XM be given.let me define a matrix K.we need to find the kernel matrix to be an M by M matrix such that K subscript Ij is equal to the kernel fuction applied to two of my examples.then it turns out that I  want you to consider Z transpose KZ.by definition of matrix multiplication,that it is and so KIJ is a kernel function between XI and XJ,if K is a valid kernel;if K is a function for which there exists some phi such that K of XI xj is the inner product between phi of XI and phi of XJ.so if K is a valid kernel,we shown that the kernel matrix must be posisemidefinite.it turns out that the conversen and so this gives you a test for whether a function K is a valid kernel.so this is a theorem due to Mercer,and so kernel are also sometimes called Mercer kernels.

Applization:to apply a support vector machine kernel,you choose one of these functions,and the choice of this would depend on your problem.it depends on what is agood measure of one or two examples similar and one or two examples different for your problem.you replace everywhere you see these things,you replace it with K of XI,xj.and then you run exactly the same support vector machine algorithm,only everywhere you see these inner products,you replace them with that and what you have just done is you have taken a support vector machine and you have take each of your feature vector X and you have replaced it with implicitly a very high dimensional feature vector.it turns out that the Gaussian kernel corrresponds to a feature vector that infinite dimensional.nontheless,you can run a support vector machine I a finite amount of time,even though you are working with infinite dimensional feature vectors.because all you ever need to do is compute these things,and you don't ever need to represent these infinite dimensional feature vectors explicitly.I started that we wanted to atart to develop non-linear learning algorithms.so here is one useful picture to keep in mind,which is that let us say your original data;let us say that you have one dimensional input data.what the kernel is the following.and then you run SVM in this infinite dimensional space and also exponentially high dimensional space,and you will find the optimal margin classifier;you linear classifier to which data is not really separable in your original space.one way to choose is save aside a small amount of your data and try different values of sigmer and trai anSVM using two thirds of your data.try different values of sigmer,then see what works best on a separate hold out cross valition set:on a separate set  that you are testing .sommetimes about learning algorithms .

Kernels:but it turns out that the idea of kernels is actually more general than support vector machine,and inparticualr,we tok this SVM algorithm ,we derived a dual and that was what let us write the entire algorithm in terms of inner products of these.it turns out that you can take many of the other algorithms that you have seen in this class: in fact,it turns out you can take most of the linear algorithms such as linear regressin,logistic regression it turns out you can take all of these algorithms and then rewrite them entirely in term of these inner products.than that mean you can replace then with K of XI XJ and that means you can take any of these algorithms and implicitly map the features vectors of these very high dimensional feature spaces and have the algorithm still work. The idea of kernel is perhaps most widely used with SVM ,but it is actually more general than that,you can any algorithms you have seen and many of the algorithms that we will see later this quarter as well and write them in terms of inner products and thereby kernelize them and apply them to infinite dimensionap feature spaces.

L1 norm Soft margin SVM:let's say I have a data set.this is a linear aeparable data set,but what I do if I have a couplle of other examples there that makes the data nonlinearly separable,and in fact,sometimes .when you derive the dual of the optimizatiion problem and when you simplify,youhave to maximize phi which is actually the same as before.so it turns out ,whenyou derive the dual and simplify,it turns out that the only way the dual change s compared to the the previous one is that rather than contraint that the alpha are greater than or equal to zero,we now have a constraint that the alphas are between zero and C.

KKT:the necessary conditions for something to be an optimal solution to constrain optimization problems.you can zctually derive conversions conditions,so we want to solve this optimization problem.when do you know the alphas have converged to the global optimum?

SMO:an algprithm for actually solving this optimization problem.we wrote down the dual optimization problem with convergence criteria,so let us come up with an efficient algorthm.we will try to change two Alphas at a time.it is called the sequential minimal optimization .the minimal refers we are choosing the smallest number of Alpha Is to change a time.which in the case we need to change at least two at a time.in order to derive that step where we update in respect to Alpha I and Aipha J,if you minimize the quadratic function,maybe you get a value that lies in the box,and if so ,you are done.maybe when you optimize your quadratic function,you may end up with a value outside,so you end up with a solution like that. If that happens,you clip your solution just to map it back inside the box.

Coordinate assent:maxmize some funnction of w,from aiphi 1 to aiphi m.and wil do from I equals 1 to m,the coordinate assent essentially holds all the parameters except alpha I fixed and then it just maximizes this function with respect to just one of the parameteres.sometimes the optimization objective w sometimes is very inexpensive to optimize w with respect to any one of your parameters,and so coordinate assent has to take many more iterations than say Newton's methhod in order to converge.it turns out that are many optimization problems for which it's particularly easy to fix all but one of the parameters and optimize with respect to just that one parameter,that if is true,then the inner group of coordinate assent with optimizing with respect to Alpha can be done very quickly and cause.it turns out that this will be true when we modify this algorithm to solve the SVM optimization problems.

It turns out either the polynomial kernel or the Galcean kernel works fine for this problem,and just by writing down this kernel and throwiing an SVM at it,an SVM gave performance comparable to the very best neuronetworks.this is superising.because SVM doesnot take into account any knowledge about the pixels,and particular it doesnot know that this pixel is next that pixel because it is just representing the pixel intensity value as a vector.

Dynamic programming algorithm:

猜你喜欢

转载自blog.csdn.net/weixin_43218659/article/details/88408158