什么是似然度~maximum likelihood

转载在:https://www.douban.com/note/640290683/

注0:《deep learning》的chapter 5有一部分讲maximum likelihood,那里讲地更清楚,建议直接去参考那里的内容。

注1:今天走在路上,突然想明白了似然度是怎么回事,它就是用来度量模型和数据之间的相似度,所以叫它似然度。如果我们的模型能够很好地描述我们的数据,那么似然度就大,反之,则小。但是似然度这个函数应该是什么样子呢?在路上想了半天,无果。回来后,google到一篇blog。看过之后,才发觉,当初发现似然度这个概念的人真是聪明。我把原blog照搬过来,如下。

注2:原文链接:https://codesachin.wordpress.com/2016/07/24/the-basics-of-likelihood/

Anyone who has done some course in statistics or data science, must have come across the term ‘likelihood’. In non-technical language, likelihoodis synonymous with probability. But ask any mathematician, and their interpretation of the two concepts is significantlydifferent. I went digging into likelihood for a bit this week, so thought of putting down the basics of what I revisited.

Whenever we talk about understanding data, we talk about models. In statistics, a model is usuallysome sort of parameterized function – like a probability density function (pdf) or a regression model. But the effectiveness of the model’s outputs will only be as good as its fit to the data. If the characteristics of the data are very different from the assumed model, the bias turns out to be pretty high.So from a probability perspective, how do we quantify this fit?

Defining the Likelihood Function

The two entities of interest are – the data X, and the parameters θ . Now considera function F(X,θ), thatreturns a number proportional to the degree of ‘fit’ betweenthe two – essentially quantifying their relationship with each other.

There are two practical ways you could deal withthis function. If you kept constant and variedthe data X being analyzed, you would be getting a function F1(X) whose only argument is the data. Theoutput would basically be a measure of how well your input data satisfies the assumptions made by the model.

But in real life, you rarely know your model with certainty. What you do have, is a bunch of observed data. So shouldn’t we also think of the other way round? Suppose you kept X constant, but tried varyin θ instead. Now, what you have got is a function F2(θ) , thatcomputes how well different sets of parameters describe your data (which you have for real).

Mind you, in both cases, the underlying mathematical definitionis the same. The input‘variable’ is what has changed. This is how probability and likelihood are related. The function F1 is what we call the probability function (or pdf for the continuous case), and F2 is called the likelihood function. While assumes you know your model and tries to analyze data according to it, F2 keeps the data in perspective while figuring out how well different sets ofparameters describe it.

The above definition might make you think that the likelihood is nothing but a rewording of probability. But keeping the data constant, and varying the parameters has huge consequences on the way you interpret the resultant function.

Lets take a simple example. Consider you have a set of differentcoin tosses, where r out of them were heads, while the others were tails. Lets say that the coin used for tossing was biased, and the probability of a head coming up on it is p . In this case,

Fig 0.

Now suppose you made coin yourself, so you know p=0.7. In that case,

Fig 1.

On the other hand, lets say you don’t know much about the coin, but you do have a bunch of toss-outcomes from it. You made 10 different tosses, out which 5 were heads . From this data, you want to measure howlikely it is that your guess of p is correct. Then,

Fig 2.

There is a very, very important distinction between probability and likelihood functions – the value of the probability function sums (or integrates, for continuous data) to 1 over all possible values of the input data. However, the value of the likelihood function does not integrate to 1 over all possible combinations of the parameters.

The above statement leads to the second important thing to note: DO NOT interpret the value of a likelihood function, as the probability of the model parameters. If your probability function gave the value of 0.7 (say) for a discrete data point, you could be pretty sure that there would be no other option as likely as it. This is because, the sum of the probabilities of all other point would be equal to 0.3. However, if you got 0.99 as the output of your likelihood function, it wouldn’tnecessarily mean that the parameters you gave in are the most likely ones. Some other set of parameters might give 0.999, 0.9999 or something higher.

The only thing you can be sure of, is this: If F2(θ1)>F2(θ2) , then it is more likely that denote the parameters of the underlying model.

猜你喜欢

转载自www.cnblogs.com/yuyongsheng1990/p/9336734.html