Andrew Ng machine learning 课程笔记--牛顿方法

In the last lecture,I talked about the logistic regression model.And then you can write down the log likelihood like given the training sets,you can derive of a gradient ascent rule for finfing the maximum likelihood estimate.

 

All of these algorithms tend not to converges problems,and all of these algorithms will generally converge,unless you choose too large a linear rate for gradientascent or something.But the speeds of conversions of these algorithms are very different.

 

It turns out that Newton's method is an algorithm that enjoys extremely fast conversions.the technical term is that it enjoys  a property caled quadratic conversions.it means that every iterations of Newton's method will double the number of significant digits that your solution is accurate to.just lots of constant factors.

Suppose that an a certain iteration your solution is within 0.01 at the optimum,so you have 0.01 erroe.Then after one iteratin,your error will be on the order of 0.001,and after another iteration,your error will be on the order of 0.0000001.So this is called quadratic conversions because you essentially get to square,the error on every iteration of Newton's Method.this result that holds only when your cause the optimum anyway,so this is the theoretical result that says it's true,but because of constant factors and so on,may paint a slightly poisier picture than might be accurate.

 

When you implement Newton's Method,for logistic regression,usually converges like a dozen iterations or so for most reasonable size problems of tens of hundreds of features.

 

The generalization to Newton's Method for when theta is a vector rather than when theta is just a row number ,is the following.

 

So for logistic regression,again,use the for a reasonable number of features and training examples when I ren this algorithm,usually you see a conversion anywhere from sort of  questions to like a dozen or so other.To compare to gradient ascent,it's  fast to gradient ascet,this usually means far fewer iterations to converge.Compared to gradient ascent,it's batch gradient ascent ,the disadvantage of Newton's Method is that on every iteration you need to invert the hessian.So the hessian will be an N-by-N matrix,or an N plus one by N plus one-dimensional matrix if N is the number of features.

 

If you have a large number of features in your learning problem,if you have tens of thousands of geatures,then inverting H could be a slightly computationally expensive step.but for smaller,more reasonable numbers of features,this is usually a very fast algorithm.

 

I wrote down thas algorithm to find the maximum likely estimate of the parameters for logistic regression.I wrote this down for maximizing a function.

 

Let's talk about generalized linear models. Sigmoid function turns out to be a natural default choice that lead us to logistic regression.

The bernoulli and the Gaussian:we suppose we have data that is zero-one valued,and we want to model it with a bernoulli random variable Parameterized by phi.So the bernoulli distribution has the probablility of Y equals one,which just equals the phi.

If you consider Gaussian distribution,you would get different Gaussian distributions.Both of these are special cases of the class of distribution that's called the exponential family distribution.

 

The exponential family distribution now becomes exacty the formula for the distributions(B5,A22).It turns out that most of the 'textbook distributions',most of them,can be written in the form of an exponential family distribution. It turns out the multivariate normal distribution,which is a generalization of Gaussian random variables,so it's a high dimension to vectors.The normal distribution is also in the exponential family.You may have heard of the poisson distribution.So the poisson distribution is often used for modeling counts.Things like the number of radioactive decays in a sample,or the number of customers to your website,the numbers of visitors arriving in a store.The poisson distribution is also in the exponential family.So the gamma and the exponential distributions are distributions of the positive numbers.So they're often used in mmodel intervals,like if you're standing at the bus stop and you want to ask,when is the next bus likely to arrive?often you model that with sort of gamma distribution or exponential families,or the exponential distribution.Those are also in the exponential family.These are probably distributions over fractions are already probalility distributions.And also things like the Wishart distribution,which is the distribution over covariance matrices(协方差矩阵).All of theaae can be written in the form of exponential family distributions.

 

Generalized linear models

Three assumptions:1.assume that given my input X and my parameters theta,I'm going to assume that the variable Y,the output Y,or the response variable Y I'm trying to predict is distributed exponential family given X and parameterized by theta,those exponential families with parameter.There is some specific choice of those functions,A,B,and T,so that the conditional distribution of Y given X and parameterized by theta.So if you want to predict how many visitors have arrived at your website,you may choose to model the number of people the number of hits on your website by poisson distribution,since poisson distribution is natural for modeling count data.And so you may choose the exponential family distribution here to be the poisson distribution.

 

2.Given x,goal is to outputE[T(Y)|X],my goal is to get my learning algorithms hypothesis to output the expectedvalue of E[T(Y)|X]

3.This is the one you maybe wanna think of this as a design choice.Which is assume that the distribution of Y given X is a ditributed exponential family with some parameter yita.So the number of visitors on the website on any given day will be poisson or some parameter.And the last decision I need to make is was the relationship between my input features and this parameter parameterizing my ppisson distribution or whatever.And the last step, I 'm going to make the assumption,or really a design choice that I'm going to assume the relationship between yita and my input features is linear,and in particular that they're governed by this that is equal to theta,transpose X.

And the reason I make this design choice is it will allow me to turn the crank of the generalized linear model of machinery and come off with very nice algorithm for fitting say poisson regression models or performed regression with a gamma distribution outputs or exponential distribution outputs snd so on.

 

And so that's how I come up with the logistix regression algorithm when you have a varible Y when you have a target varible Y,or also response variable Y that takes on two values,and then you choose to model bernoulli distribution.

 

 

The tiny little notation,the function G that relates G that relates the natural parameter to expected value of Y,this is called canonical response function.And G inverse  is called the canonical link function.Actually,many techs actually use the reverse way.This is G inverse and this is G ,but this notations seems consistant with other algorithms in machine learning.

 

I'm going to skip over the Gaussian example.But again,just like I said,Y bernounlli,different variation I get of logistic regression.you can do the same thing with gaussian distribution and end up with ordinary linear squares model.The problem with gaussian is that It's almost so simple that when you see it for the first time that It's sometimes more confusing

That the simple model because it looks like it has to be more complicated.

 

How do choose what theta will be ?What you have there is the logistiic regression model,which is a probablilstic model that assumes the probability of Y given X is given by a certain form.What you do is you can write down the log likelihood of your training set,and find the value of theta that maxmizes the log likelihood of the parameters.

 

What I want to do is talk about Multinomial.And Multinomial is the distribution over K possible outcomes.Imagine you're now in  a machine learning problem where the value of Y that you're trying to predict can take on K possible outcomes,or rather than only two outcomes.if you want to have a learning algorithm,or to magically snd emails for you into your right email folder,and you may have a dozen of email foolders you want your algorithm to classify emails into.Or predicting if the patient either has  A disease or does not have a disease,which would be a binary clasification problem.If you think that the patient may have one of K diseases,and you want other than have a learning algorithm figure out which one of K diseases your patient has is all.So lots of multi-class classification problems where you have more than two classes.you model that with multinomial.

 

By choosing A and B this way,I can take my distribution from multinomial and write it out  in the form of an exponential family distribution.

 

And so just to give this algorithm a name,this algorithm is called softmax regression,and is widely thought of as the generalization of logistic regression,which is regression of two classes.Is widely thought of as a generalization of logistic regression to the case of K classes rather than two classes.So you have a machine learning problem ,and you want to apply softmax refression to it.So generally,work for the entire derivation.

 

How to fit parameters.

Let's say you have a machine learning problem,Y takes on one of K classes.极大似然估计得到参数theta.

 

猜你喜欢

转载自blog.csdn.net/weixin_43218659/article/details/87912920