本文为 $I n t r o d u c t i o n$ $t o$ $P r o b a b i l i t y$ 的读书笔记

The MAP Rule (最大后验概率准则)

The Maximum a Posteriori Probability (MAP) Rule

Given the observation value $x$ , the MAP rule selects a value $\hat\theta$ that maximizes over $\theta$ the posterior distribution $p_{\Theta|X}(\theta|x)$ (if $\Theta$ is discrete) or $f_{\Theta|X}(\theta|x)$ (if $\Theta$ is continuous)
$\begin{aligned}\hat\theta&=\arg\max_{\theta}p_{\Theta|X}(\theta|x),\ \ \ \ \ \ \ \ \ (\Theta\ discrete)\\ \hat\theta&=\arg\max_{\theta}f_{\Theta|X}(\theta|x),\ \ \ \ \ \ \ \ \ (\Theta\ continuous)\end{aligned}$
The form of the posterior distribution, as given by Bayes’ rule, allows an important computational shortcut: the denominator is the same for all $\theta$ and depends only on the value $x$ of the observation. Thus, to maximize the posterior, we only need to choose a value of $\theta$ that maximizes the numerator over $\theta$ : (Calculation of the denominator is unnecessary.)
$\begin{aligned}&p_\Theta(\theta)p_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ and\ X\ are\ discrete) \\&p_\Theta(\theta)f_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ is\ discrete\ and\ X\ is\ continuous) \\&f_\Theta(\theta)p_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ is\ continuous\ and\ X\ is\ discrete) \\&f_\Theta(\theta)f_{X|\Theta}(x|\theta)\ \ \ \ \ \ (if\ \Theta\ and\ X\ are\ continuous)\end{aligned}$
If $\Theta$ takes only a finite number of values, the MAP rule minimizes (over all decision rules) the probability of selecting an incorrect hypothesis. This is true for both the unconditional probability of error and the conditional one, given any observation value $x$ .

If $\Theta$ is continuous, the actual evaluation of the MAP estimate $\hat\theta$ can sometimes be carried out analytically;
- for example, if there are no constraints on $\theta$ , by setting to zero the derivative of $f_{\Theta|X}(\theta|x)$ , or of $\log f_{\Theta|X}(\theta|x)$ , and solving for $\theta$ .
- In other cases, however, a numerical search may be required.

Point Estimation (点估计)

In an estimation problem, given the observed value $x$ of $X$ , the posterior distribution captures all the relevant information provided by $x$ . On the other hand, we may be interested in certain quantities that summarize properties of the posterior. For example, we may select a point estimate, which is a single numerical value that represents our best guess of the value of $\Theta$ .

Concepts and Terminology

We use the term estimate to refer to the numerical value $\hat\theta$ that we choose to report on the basis of the actual observation $x$ .
The value of $\hat\theta$ is to be determined by applying some function $g$ to the observation $x$ , resulting in $\hat\theta = g(x)$ . The random variable $\Theta= g(X)$ is called an estimator.

We can use different functions $g$ to form different estimators; We have already seen two of the most popular estimators:
- (a) The Maximum a Posteriori Probability (MAP) estimator. Here, having observed $x$ , we choose an estimate $\hat\theta$ that maximizes the posterior distribution over all $\theta$ , breaking ties arbitrarily.
- (b) The Conditional Expectation estimator / The Least Mean Squared Error (LMS) estimator. Here, we choose the estimate $\hat\theta=E[\Theta|X=x]$ . It has an important property: it minimizes the mean squared error over all estimators.

In the absence of additional assumptions, a point estimate carries no guarantees on its accuracy. For example, the MAP estimate may lie quite far from the bulk of the posterior distribution.
Thus, it is usually desirable to also report some additional information, such as the conditional mean squared error $[(\hat\Theta - \Theta)^2 | X =x]$ .

Example 8.7.
Consider Example 8.2, in which Juliet is late on the first date by a random amount $X$ . The distribution of $X$ is uniform over the interval $[0,\Theta]$ , and $\Theta$ is an unknown random variable with a uniform prior PDF $f_\Theta$ over the interval $[0, 1]$ . In that example, we saw that for $\in[0, 1]$ , the posterior PDF is
在这里插入图片描述

For a given $f_{\Theta|X}(\theta|x)$ is decreasing in $\theta$ , over the range $[x, 1]$ of possible values of $\theta$ . Thus. the MAP estimate is equal to $x$ . Note that this is an “optimistic” estiinate.
The conditional expectation estimate turns out to be less “optimistic.” In particular, we have
$E[\Theta|X=x]=\int_x^1\theta\cdot\frac{1}{\theta\cdot|\log x|}d\theta=\frac{1-x}{|\log x|}$

Hypothesis Testing

In a hypothesis testing problem, $\Theta$ takes one of $m$ values, $\theta_1, ... ,\theta_m$ , where $m$ is usually a small integer; We refer to the event $\{\Theta = \theta_i\}$ as hypothesis, and denote it by $H_i$ .

在这里插入图片描述

Once we have derived the MAP rule, we may also compute the corresponding probability of a correct decision (or error), as a function of the observation value $x$ . In particular, if $g_{MAP} (x)$ is the hypothesis selected by the MAP rule when $X = x$ , the probability of correct decision is
$P(\Theta=g_{MAP}(x)|X=x)$
Furthermore, if $S_i$ is the set of all $x$ such that the MAP rule selects hypothesis $H_i$ , the overall probability of correct decision is
$P(\Theta=g_{MAP}(X))=\sum_iP(\Theta=\theta_i,X\in S_i)$ and the corresponding probability of error is
$\sum_iP(\Theta\neq\theta_i,X\in S_i)$

Example 8.10. Signal Detection and the Matched Filter.

A transmitter sends one of two possible messages. Let $\Theta = 1$ or $\Theta = 2$ , depending on whether the first or the second message is transmitted. We assume that the two messages are equally likely, i.e., the prior probabilities are $p_\Theta(1) = p_\Theta(2) = 1/2$ .
In order to enhance the resiliency of the transmission with respect to noise, the transmitter sends a signal $S = (S_1, ... ,S_n )$ , where each $S_i$ is a real number. If $\Theta = 1$ (respectively, $\Theta = 2$ ), then $S$ is a fixed sequence $a_1, . .. , a_n)$ [respectively, $b_1, ... ,b_n)$ ]. We assume that the two candidate signals have the same “energy,” i.e., $a_1^2 +· · · +a_n^2= b_1^2 +· · ·+ b_n^2$ . The receiver observes the transmitted signal, but corrupted by additive noise. More specifically, it obtains the observations
$X_i = S_i + W_i, i = 1, ... , n$ where we assume that the $W_i$ are standard normal random variables, independent of each other and independent of the signal.
Under the hypothesis $\Theta = 1$ , the $X_i$ are independent normal random variables, with mean $a_i$ and unit variance. Thus,
$f_{X|\Theta}(x|1)=\frac{1}{(\sqrt{2\pi})^n}e^{-((x_1-a_1)^2+...+(x_n-a_n)^2)/2}$ Similarly,
$f_{X|\Theta}(x|2)=\frac{1}{(\sqrt{2\pi})^n}e^{-((x_1-b_1)^2+...+(x_n-b_n)^2)/2}$ From Bayes’ rule, the probability that the first message was transmitted is
$\frac{\exp\big\{-((x_1-a_1)^2+...+(x_n-a_n)^2)/2 \big\}}{\exp\big\{-((x_1-a_1)^2+...+(x_n-a_n)^2)/2 \big\}+\exp\big\{-((x_1-b_1)^2+...+(x_n-b_n)^2)/2 \big\}}$ After expanding the squared terms and using the assumption $a_1^2 +· · · +a_n^2= b_1^2 +· · ·+ b_n^2$ , this expression simplifies to
$P(\Theta=1|X=x)=p_{\Theta|x}(1|x)=\frac{e^{(a_1x_1+...+a_nx_n)}}{e^{(a_1x_1+...+a_nx_n)}+e^{(b_1x_1+...+b_nx_n)}}$ The formula for $P(\Theta = 2| X = x )$ is similar, with the $a_i$ in the numerator replaced by $b_i$ .
According to the MAP rule, we should choose the hypothesis with maximum posterior probability, which yields:
This particular structure for deciding which signal was transmitted is called a matched filter: we “match” the received signal $x_1, ... , x_n )$ with each of the two candidate signals by forming the inner products $\sum_{i=1}^na_ix_i$ and $\sum_{i=1}^nb_ix_i$ ; we then select the hypothesis that yields the higher value (the “best match”).

Chapter 8 (Bayesian Statistical Inference): The MAP Rule, Point Estimation, and Hypothesis Testing

目录

The MAP Rule (最大后验概率准则)

Point Estimation (点估计)

Hypothesis Testing

目录

目录

The MAP Rule (最大后验概率准则)

Point Estimation (点估计)

Hypothesis Testing

猜你喜欢

目录

热门文章