Chapter 8 (Bayesian Statistical Inference): The MAP Rule, Point Estimation, and Hypothesis Testing

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

The MAP Rule (最大后验概率准则)

The Maximum a Posteriori Probability (MAP) Rule

  • Given the observation value x x x, the MAP rule selects a value θ ^ \hat\theta θ^ that maximizes over θ \theta θ the posterior distribution p Θ ∣ X ( θ ∣ x ) p_{\Theta|X}(\theta|x) pΘX(θx) (if Θ \Theta Θ is discrete) or f Θ ∣ X ( θ ∣ x ) f_{\Theta|X}(\theta|x) fΘX(θx) (if Θ \Theta Θ is continuous)
    θ ^ = arg ⁡ max ⁡ θ p Θ ∣ X ( θ ∣ x ) ,           ( Θ   d i s c r e t e ) θ ^ = arg ⁡ max ⁡ θ f Θ ∣ X ( θ ∣ x ) ,           ( Θ   c o n t i n u o u s ) \begin{aligned}\hat\theta&=\arg\max_{\theta}p_{\Theta|X}(\theta|x),\ \ \ \ \ \ \ \ \ (\Theta\ discrete)\\ \hat\theta&=\arg\max_{\theta}f_{\Theta|X}(\theta|x),\ \ \ \ \ \ \ \ \ (\Theta\ continuous)\end{aligned} θ^θ^=argθmaxpΘX(θx),         (Θ discrete)=argθmaxfΘX(θx),         (Θ continuous)
  • The form of the posterior distribution, as given by Bayes’ rule, allows an important computational shortcut: the denominator is the same for all θ \theta θ and depends only on the value x x x of the observation. Thus, to maximize the posterior, we only need to choose a value of θ \theta θ that maximizes the numerator over θ \theta θ: (Calculation of the denominator is unnecessary.)
    p Θ ( θ ) p X ∣ Θ ( x ∣ θ )        ( i f   Θ   a n d   X   a r e   d i s c r e t e ) p Θ ( θ ) f X ∣ Θ ( x ∣ θ )        ( i f   Θ   i s   d i s c r e t e   a n d   X   i s   c o n t i n u o u s ) f Θ ( θ ) p X ∣ Θ ( x ∣ θ )        ( i f   Θ   i s   c o n t i n u o u s   a n d   X   i s   d i s c r e t e ) f Θ ( θ ) f X ∣ Θ ( x ∣ θ )        ( i f   Θ   a n d   X   a r e   c o n t i n u o u s ) \begin{aligned}&p_\Theta(\theta)p_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ and\ X\ are\ discrete) \\&p_\Theta(\theta)f_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ is\ discrete\ and\ X\ is\ continuous) \\&f_\Theta(\theta)p_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ is\ continuous\ and\ X\ is\ discrete) \\&f_\Theta(\theta)f_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ and\ X\ are\ continuous)\end{aligned} pΘ(θ)pXΘ(xθ)      (if Θ and X are discrete)pΘ(θ)fXΘ(xθ)      (if Θ is discrete and X is continuous)fΘ(θ)pXΘ(xθ)      (if Θ is continuous and X is discrete)fΘ(θ)fXΘ(xθ)      (if Θ and X are continuous)
  • If Θ \Theta Θ takes only a finite number of values, the MAP rule minimizes (over all decision rules) the probability of selecting an incorrect hypothesis. This is true for both the unconditional probability of error and the conditional one, given any observation value x x x.

  • If Θ \Theta Θ is continuous, the actual evaluation of the MAP estimate θ ^ \hat\theta θ^ can sometimes be carried out analytically;
    • for example, if there are no constraints on θ \theta θ, by setting to zero the derivative of f Θ ∣ X ( θ ∣ x ) f_{\Theta|X}(\theta|x) fΘX(θx), or of log ⁡ f Θ ∣ X ( θ ∣ x ) \log f_{\Theta|X}(\theta|x) logfΘX(θx), and solving for θ \theta θ.
    • In other cases, however, a numerical search may be required.

Point Estimation (点估计)

  • In an estimation problem, given the observed value x x x of X X X, the posterior distribution captures all the relevant information provided by x x x. On the other hand, we may be interested in certain quantities that summarize properties of the posterior. For example, we may select a point estimate, which is a single numerical value that represents our best guess of the value of Θ \Theta Θ.

Concepts and Terminology

  • We use the term estimate to refer to the numerical value θ ^ \hat\theta θ^ that we choose to report on the basis of the actual observation x x x.
  • The value of θ ^ \hat\theta θ^ is to be determined by applying some function g g g to the observation x x x, resulting in θ ^ = g ( x ) \hat\theta = g(x) θ^=g(x). The random variable Θ = g ( X ) \Theta= g(X) Θ=g(X) is called an estimator.

  • We can use different functions g g g to form different estimators; We have already seen two of the most popular estimators:
    • (a) The Maximum a Posteriori Probability (MAP) estimator. Here, having observed x x x, we choose an estimate θ ^ \hat\theta θ^ that maximizes the posterior distribution over all θ \theta θ, breaking ties arbitrarily.
    • (b) The Conditional Expectation estimator / The Least Mean Squared Error (LMS) estimator. Here, we choose the estimate θ ^ = E [ Θ ∣ X = x ] \hat\theta=E[\Theta|X=x] θ^=E[ΘX=x]. It has an important property: it minimizes the mean squared error over all estimators.

  • In the absence of additional assumptions, a point estimate carries no guarantees on its accuracy. For example, the MAP estimate may lie quite far from the bulk of the posterior distribution.
  • Thus, it is usually desirable to also report some additional information, such as the conditional mean squared error E [ ( Θ ^ − Θ ) 2 ∣ X = x ] E [(\hat\Theta - \Theta)^2 | X =x] E[(Θ^Θ)2X=x].

Example 8.7.
Consider Example 8.2, in which Juliet is late on the first date by a random amount X X X. The distribution of X X X is uniform over the interval [ 0 , Θ ] [0,\Theta] [0,Θ], and Θ \Theta Θ is an unknown random variable with a uniform prior PDF f Θ f_\Theta fΘ over the interval [ 0 , 1 ] [0, 1] [0,1]. In that example, we saw that for x ∈ [ 0 , 1 ] x \in[0, 1] x[0,1], the posterior PDF is
在这里插入图片描述

  • For a given f Θ ∣ X ( θ ∣ x ) f_{\Theta|X}(\theta|x) fΘX(θx) is decreasing in θ \theta θ, over the range [ x , 1 ] [x , 1] [x,1] of possible values of θ \theta θ. Thus. the MAP estimate is equal to x x x. Note that this is an “optimistic” estiinate.
  • The conditional expectation estimate turns out to be less “optimistic.” In particular, we have
    E [ Θ ∣ X = x ] = ∫ x 1 θ ⋅ 1 θ ⋅ ∣ log ⁡ x ∣ d θ = 1 − x ∣ log ⁡ x ∣ E[\Theta|X=x]=\int_x^1\theta\cdot\frac{1}{\theta\cdot|\log x|}d\theta=\frac{1-x}{|\log x|} E[ΘX=x]=x1θθlogx1dθ=logx1x
    在这里插入图片描述

Hypothesis Testing

  • In a hypothesis testing problem, Θ \Theta Θ takes one of m m m values, θ 1 , . . . , θ m \theta_1, ... ,\theta_m θ1,...,θm, where m m m is usually a small integer; We refer to the event { Θ = θ i } \{\Theta = \theta_i\} { Θ=θi} as hypothesis, and denote it by H i H_i Hi.

在这里插入图片描述


  • Once we have derived the MAP rule, we may also compute the corresponding probability of a correct decision (or error), as a function of the observation value x x x. In particular, if g M A P ( x ) g_{MAP} (x) gMAP(x) is the hypothesis selected by the MAP rule when X = x X = x X=x, the probability of correct decision is
    P ( Θ = g M A P ( x ) ∣ X = x ) P(\Theta=g_{MAP}(x)|X=x) P(Θ=gMAP(x)X=x)
  • Furthermore, if S i S_i Si is the set of all x x x such that the MAP rule selects hypothesis H i H_i Hi, the overall probability of correct decision is
    P ( Θ = g M A P ( X ) ) = ∑ i P ( Θ = θ i , X ∈ S i ) P(\Theta=g_{MAP}(X))=\sum_iP(\Theta=\theta_i,X\in S_i) P(Θ=gMAP(X))=iP(Θ=θi,XSi)and the corresponding probability of error is
    ∑ i P ( Θ ≠ θ i , X ∈ S i ) \sum_iP(\Theta\neq\theta_i,X\in S_i) iP(Θ=θi,XSi)

Example 8.10. Signal Detection and the Matched Filter.

  • A transmitter sends one of two possible messages. Let Θ = 1 \Theta = 1 Θ=1 or Θ = 2 \Theta = 2 Θ=2, depending on whether the first or the second message is transmitted. We assume that the two messages are equally likely, i.e., the prior probabilities are p Θ ( 1 ) = p Θ ( 2 ) = 1 / 2 p_\Theta(1) = p_\Theta(2) = 1/2 pΘ(1)=pΘ(2)=1/2.
  • In order to enhance the resiliency of the transmission with respect to noise, the transmitter sends a signal S = ( S 1 , . . . , S n ) S = (S_1, ... ,S_n ) S=(S1,...,Sn), where each S i S_i Si is a real number. If Θ = 1 \Theta = 1 Θ=1 (respectively, Θ = 2 \Theta = 2 Θ=2), then S S S is a fixed sequence ( a 1 , . . . , a n ) (a_1, . .. , a_n) (a1,...,an) [respectively, ( b 1 , . . . , b n ) (b_1, ... ,b_n) (b1,...,bn)]. We assume that the two candidate signals have the same “energy,” i.e., a 1 2 + ⋅ ⋅ ⋅ + a n 2 = b 1 2 + ⋅ ⋅ ⋅ + b n 2 a_1^2 +· · · +a_n^2= b_1^2 +· · ·+ b_n^2 a12++an2=b12++bn2. The receiver observes the transmitted signal, but corrupted by additive noise. More specifically, it obtains the observations
    X i = S i + W i , i = 1 , . . . , n X_i = S_i + W_i, i = 1, ... , n Xi=Si+Wi,i=1,...,nwhere we assume that the W i W_i Wi are standard normal random variables, independent of each other and independent of the signal.
  • Under the hypothesis Θ = 1 \Theta = 1 Θ=1, the X i X_i Xi are independent normal random variables, with mean a i a_i ai and unit variance. Thus,
    f X ∣ Θ ( x ∣ 1 ) = 1 ( 2 π ) n e − ( ( x 1 − a 1 ) 2 + . . . + ( x n − a n ) 2 ) / 2 f_{X|\Theta}(x|1)=\frac{1}{(\sqrt{2\pi})^n}e^{-((x_1-a_1)^2+...+(x_n-a_n)^2)/2} fXΘ(x1)=(2π )n1e((x1a1)2+...+(xnan)2)/2Similarly,
    f X ∣ Θ ( x ∣ 2 ) = 1 ( 2 π ) n e − ( ( x 1 − b 1 ) 2 + . . . + ( x n − b n ) 2 ) / 2 f_{X|\Theta}(x|2)=\frac{1}{(\sqrt{2\pi})^n}e^{-((x_1-b_1)^2+...+(x_n-b_n)^2)/2} fXΘ(x2)=(2π )n1e((x1b1)2+...+(xnbn)2)/2From Bayes’ rule, the probability that the first message was transmitted is
    exp ⁡ { − ( ( x 1 − a 1 ) 2 + . . . + ( x n − a n ) 2 ) / 2 } exp ⁡ { − ( ( x 1 − a 1 ) 2 + . . . + ( x n − a n ) 2 ) / 2 } + exp ⁡ { − ( ( x 1 − b 1 ) 2 + . . . + ( x n − b n ) 2 ) / 2 } \frac{\exp\big\{-((x_1-a_1)^2+...+(x_n-a_n)^2)/2 \big\}}{\exp\big\{-((x_1-a_1)^2+...+(x_n-a_n)^2)/2 \big\}+\exp\big\{-((x_1-b_1)^2+...+(x_n-b_n)^2)/2 \big\}} exp{ ((x1a1)2+...+(xnan)2)/2}+exp{ ((x1b1)2+...+(xnbn)2)/2}exp{ ((x1a1)2+...+(xnan)2)/2}After expanding the squared terms and using the assumption a 1 2 + ⋅ ⋅ ⋅ + a n 2 = b 1 2 + ⋅ ⋅ ⋅ + b n 2 a_1^2 +· · · +a_n^2= b_1^2 +· · ·+ b_n^2 a12++an2=b12++bn2, this expression simplifies to
    P ( Θ = 1 ∣ X = x ) = p Θ ∣ x ( 1 ∣ x ) = e ( a 1 x 1 + . . . + a n x n ) e ( a 1 x 1 + . . . + a n x n ) + e ( b 1 x 1 + . . . + b n x n ) P(\Theta=1|X=x)=p_{\Theta|x}(1|x)=\frac{e^{(a_1x_1+...+a_nx_n)}}{e^{(a_1x_1+...+a_nx_n)}+e^{(b_1x_1+...+b_nx_n)}} P(Θ=1X=x)=pΘx(1x)=e(a1x1+...+anxn)+e(b1x1+...+bnxn)e(a1x1+...+anxn)The formula for P ( Θ = 2 ∣ X = x ) P(\Theta = 2| X = x ) P(Θ=2X=x) is similar, with the a i a_i ai in the numerator replaced by b i b_i bi.
  • According to the MAP rule, we should choose the hypothesis with maximum posterior probability, which yields:
    在这里插入图片描述This particular structure for deciding which signal was transmitted is called a matched filter: we “match” the received signal ( x 1 , . . . , x n ) (x_1, ... , x_n ) (x1,...,xn) with each of the two candidate signals by forming the inner products ∑ i = 1 n a i x i \sum_{i=1}^na_ix_i i=1naixi and ∑ i = 1 n b i x i \sum_{i=1}^nb_ix_i i=1nbixi; we then select the hypothesis that yields the higher value (the “best match”).

猜你喜欢

转载自blog.csdn.net/weixin_42437114/article/details/114136215
今日推荐