Chapter 2 (Discrete Random Variables): Independence (独立性)

本文为 I n t r o d u c t i o n Introduction Introduction t o to to P r o b a b i l i t y Probability Probability 的读书笔记

Independence of a Random Variable from an Event

  • The idea is that knowing the occurrence of the conditioning event provides no new information on the value of the random variable. More formally, we sat that the random variable X X X is independent of the event A A A if
    P ( X = x   a n d   A ) = P ( X = x ) P ( A ) = p X ( x ) P ( A )       f o r   a l l   x P(X=x\ and\ A)=P(X=x)P(A)=p_X(x)P(A)\ \ \ \ \ for\ all\ x P(X=x and A)=P(X=x)P(A)=pX(x)P(A)     for all xas long as P ( A ) > 0 P(A) > 0 P(A)>0, independence is the same as the condition
    p X ∣ A ( x ) = p X ( x )       f o r   a l l   x p_{X|A}(x)=p_X(x)\ \ \ \ \ for\ all\ x pXA(x)=pX(x)     for all x

Independence of Random Variables

  • We say that two random variables X X X and Y Y Y are independent if
    p X , Y ( x , y ) = p X ( x ) p Y ( y )       f o r   a l l   x , y p_{X,Y}(x,y)=p_X(x)p_Y(y)\ \ \ \ \ for\ all\ x,y pX,Y(x,y)=pX(x)pY(y)     for all x,y, which is equivalent to the condition
    p X ∣ Y ( x ∣ y ) = p X ( x )         f o r   a l l   y   w i t h   p Y ( y ) > 0   a n d   a l l   x p_{X|Y}(x|y)=p_X(x)\ \ \ \ \ \ \ for\ all\ y\ with\ p_Y(y) > 0\ and\ all\ x pXY(xy)=pX(x)       for all y with pY(y)>0 and all x
  • X X X and Y Y Y are said to be conditionally independent, given a positive probability event A A A, if
    P ( X = x , y = y ∣ A ) = P ( X = x ∣ A ) P ( Y = y ∣ A ) ,       f o r   a l l   x   a n d   y P(X = x, y = y | A) = P(X =x| A)P(Y = y | A),\ \ \ \ \ for\ all\ x\ and\ y P(X=x,y=yA)=P(X=xA)P(Y=yA),     for all x and yor, in this chapter’s notation
    p X , Y ∣ A ( x , y ) = p X ∣ A ( x ) p Y ∣ A ( y ) ,       f o r   a l l   x   a n d   y p_{X,Y|A}(x, y) = p_{X|A}(x)p_{Y|A}(y),\ \ \ \ \ for\ all\ x\ and\ y pX,YA(x,y)=pXA(x)pYA(y),     for all x and yOnce more, this is equivalent to
    p X ∣ Y , A ( x ∣ y ) = p X ∣ A ( x )      f o r   a l l   x   a n d   y   s u c h   t h a t   p Y ∣ A ( y ) > 0 p_{X|Y,A}(x|y)=p_{X|A}(x)\ \ \ \ for\ all\ x\ and\ y\ such\ that\ p_{Y|A}(y)>0 pXY,A(xy)=pXA(x)    for all x and y such that pYA(y)>0

  • If X X X and Y Y Y are independent random variables, then
    E [ X Y ] = E [ X ] E [ Y ] E [ g ( X ) h ( Y ) ] = E [ g ( X ) ] E [ h ( Y ) ] E[XY]=E[X]E[Y]\\ E[g(X)h(Y)]=E[g(X)]E[h(Y)] E[XY]=E[X]E[Y]E[g(X)h(Y)]=E[g(X)]E[h(Y)]

In fact, the second formulation follows immediately once we realize that if X X X and Y Y Y are independent, then the same is true for g ( X ) g(X) g(X) and h ( Y ) h(Y) h(Y).

  • Consider now the sum X + Y X + Y X+Y of two independent random variables X X X and Y Y Y, and let us calculate its variance. Since the variance of a random variable is unchanged when the random variable is shifted by a constant, it is convenient to work with the zero-mean random variables X ~ = X − E [ X ] , Y ~ = Y − E [ Y ] \tilde X=X-E[X],\tilde Y=Y-E[Y] X~=XE[X],Y~=YE[Y]. We have
    v a r ( X + Y ) = v a r ( X ~ + Y ~ ) = E [ ( X ~ + Y ~ ) 2 ] = E [ X ~ 2 ] + E [ Y ~ 2 ] = v a r ( X ~ ) + v a r ( Y ~ ) = v a r ( X ) + v a r ( Y ) \begin{aligned}var(X+Y)&=var(\tilde X+\tilde Y) \\&=E[(\tilde X+\tilde Y)^2] \\&=E[\tilde X^2]+E[\tilde Y^2] \\&=var(\tilde X)+var(\tilde Y) \\&=var(X)+var(Y) \end{aligned} var(X+Y)=var(X~+Y~)=E[(X~+Y~)2]=E[X~2]+E[Y~2]=var(X~)+var(Y~)=var(X)+var(Y)In conclusion. the variance of the sum of two independent random variables is equal to the sum of their variances.

Independence of Several Random Variables

  • The preceding discussion extends naturally to the case of more than two random variables. For example. three random variables X , Y X, Y X,Y, and Z Z Z are said to be independent if
    p X , Y , Z ( x , y , z ) = p X ( x ) p Y ( y ) p Z ( z ) ,       f o r   a l l   x , y , z p_{X,Y,Z}(x,y,z)=p_X(x)p_Y(y)p_Z(z),\ \ \ \ \ for\ all\ x,y,z pX,Y,Z(x,y,z)=pX(x)pY(y)pZ(z),     for all x,y,z
  • If X , Y X, Y X,Y, and Z Z Z are independent random variables. then any three random variables of the form f ( X ) f (X) f(X), g ( Y ) g(Y) g(Y), and h ( Z ) h(Z) h(Z), are also independent. Similarly. any two random variables of the form g ( X , Y ) g(X, Y) g(X,Y) and h ( Z ) h(Z) h(Z) are independent. On the other hand. two random variables of the form g ( X , Y ) g(X, Y) g(X,Y) and h ( Y , Z ) h(Y, Z) h(Y,Z) are usually not independent because they are both affected by Y Y Y.
    • Properties such as the above are intuitively clear if we interpret independence in terms of noninteracting (sub) experiments. They can be formally verified but this is sometimes tedious.

Variance of the Sum of Independent Random Variables

  • If X 1 , X 2 . . . . . X n X_1, X_2 ..... X_n X1,X2.....Xn are independent random variables, then
    v a r ( X 1 + X 2 + ⋅ ⋅ ⋅ + X n ) = v a r ( X 1 ) + v a r ( X 2 ) + . . . + v a r ( X n ) var(X_1 + X_2 +· · ·+ X_n) = var(X_1) + var(X_2)+...+ var(X_n) var(X1+X2++Xn)=var(X1)+var(X2)+...+var(Xn)

Example 2.20. Variance of the Binomial and the Poisson.
We consider n n n independent coin tosses. with each toss having probability p p p of coming up a head. For each i i i, we let X i X_i Xi be the Bernoulli random variable which is equal to 1 if the i i ith toss comes up a head, and is 0 otherwise.

  • Then. X = X 1 + X 2 + ⋅ ⋅ ⋅ + X n X = X_1 + X_2 +· · ·+ X_n X=X1+X2++Xn is a binomial random variable. Its mean is E [ X ] = n p E[X] = np E[X]=np. By the independence of the coin tosses. the random variables X 1 , . . . . X n X_1, .... X_n X1,....Xn are independent, and
    v a r ( X ) = ∑ i = 1 n v a r ( X i ) = n p ( 1 − p ) var(X)=\sum_{i=1}^nvar(X_i)=np(1-p) var(X)=i=1nvar(Xi)=np(1p)
  • Y Y Y is a Poisson random variable with parameter λ \lambda λ
    E [ Y 2 ] = ∑ k = 1 ∞ k 2 e − λ λ k k ! = λ ∑ k = 1 ∞ k e − λ λ k − 1 ( k − 1 ) ! = λ ∑ m = 0 ∞ ( m + 1 ) e − λ λ m m ! = λ ( E [ Y ] + 1 ) = λ ( λ + 1 ) \begin{aligned}E[Y^2]&=\sum_{k=1}^\infty k^2e^{-\lambda}\frac{\lambda^k}{k!} \\&=\lambda\sum_{k=1}^\infty k\frac{e^{-\lambda}\lambda^{k-1}}{(k-1)!} \\&=\lambda\sum_{m=0}^\infty(m+1)\frac{e^{-\lambda}\lambda^m}{m!} \\&=\lambda(E[Y]+1) \\&=\lambda(\lambda+1)\end{aligned} E[Y2]=k=1k2eλk!λk=λk=1k(k1)!eλλk1=λm=0(m+1)m!eλλm=λ(E[Y]+1)=λ(λ+1)from which
    v a r ( Y ) = E [ Y 2 ] − ( E [ Y ] ) 2 = λ ( λ + 1 ) − λ 2 = λ var(Y)=E[Y^2]-(E[Y])^2=\lambda(\lambda+1)-\lambda^2=\lambda var(Y)=E[Y2](E[Y])2=λ(λ+1)λ2=λ

Example 2.21. Mean and Variance of the Sample Mean.

  • Sample mean S n S_n Sn is defined as
    S n = X 1 + . . . + X n n E [ S n ] = ∑ i = 1 n 1 n E [ X i ] = E [ X ] v a r ( S n ) = ∑ i = 1 n 1 n 2 v a r ( X i ) = v a r ( X ) n S_n=\frac{X_1+...+X_n}{n}\\ E[S_n]=\sum_{i=1}^n\frac{1}{n}E[X_i]=E[X]\\ var(S_n)=\sum_{i=1}^n\frac{1}{n^2}var(X_i)=\frac{var(X)}{n} Sn=nX1+...+XnE[Sn]=i=1nn1E[Xi]=E[X]var(Sn)=i=1nn21var(Xi)=nvar(X)
  • The sample mean S n S_n Sn can be viewed as a “good” estimate of X X X the true mean E [ X ] E[X] E[X] as the sample size n n n increases. This is because it has the correct expected value, and its accuracy, as reflected by its variance, improves as the sample size n n n increases.

Problem 40.
A particular professor is known for his arbitrary grading policies. Each paper receives a grade from the set { A , A − , B + . B , B − , C + } \{A, A-, B+. B, B-, C+ \} { A,A,B+.B,B,C+}, with equal probability, independent of other papers. How many papers do you expect to hand in before you receive each possible grade at least once?

SOLUTION

  • Associate a success with a paper that receives a grade that has not been received before. Let X i X_i Xi be the number of papers between the i i ith success and the ( i + 1 ) (i + 1) (i+1)st success. Then we have X = 1 + ∑ i = 1 5 X i X = 1 +\sum_{i=1}^5 X_i X=1+i=15Xi and hence
    E [ X ] = 1 + ∑ i = 1 5 E [ X i ] E[X] = 1 +\sum_{i=1}^5E[X_i] E[X]=1+i=15E[Xi]
  • After receiving i − 1 i-1 i1 different grades so far ( i − 1 i-1 i1 successes), each subsequent paper has probability ( 6 − i ) / 6 (6-i)/6 (6i)/6 of receiving a grade that has not been received before. Therefore, the random variable X i X_i Xi is geometric with parameter p i = ( 6 − i ) / 6 p_i = (6-i)/6 pi=(6i)/6, so E [ X i ] = 6 / ( 6 − i ) E[X_i] = 6/(6-i) E[Xi]=6/(6i). It follows that
    E [ X ] = 1 + ∑ i = 1 5 6 6 − i = 1 + 6 ∑ i = 1 5 1 i = 14.7 E[X] = 1 +\sum_{i=1}^5\frac{6}{6-i}= 1 + 6\sum_{i=1}^5\frac{1}{i}= 14.7 E[X]=1+i=156i6=1+6i=15i1=14.7

Problem 41.
You drive to work 250 days a week for a full year, and with probability p p p you get a traffic ticket on any given day, independent of other days. Let X X X be the total number of tickets you get in the year. Suppose you don’t know the probability p p p of getting a ticket. but you got 5 tickets during the year, and you estimate p p p by the sample mean
p ^ = 5 250 = 0.02 \hat p=\frac{5}{250}=0.02 p^=2505=0.02What is the range of possible values of p p p assuming that the difference between p p p and the sample mean p p p is within 5 times the standard deviation of the sample mean?

SOLUTION

  • The variance of the sample mean is
    p ( 1 − p ) 250 \frac{p(1-p)}{250} 250p(1p)Then we have
    ( p − 0.02 ) 2 ≤ 25 p ( 1 − p ) 250 ∴ p ∈ [ 0.0025 , 0.1245 ] (p-0.02)^2\leq\frac{25p(1-p)}{250}\\ \therefore p\in[0.0025,0.1245] (p0.02)225025p(1p)p[0.0025,0.1245]

Problem 45
Let X 1 , . . . , X n X_1, ... , X_n X1,...,Xn be independent random variables and let X = X 1 + ⋅ ⋅ ⋅ + X n X = X_1 +· · ·+ X_n X=X1++Xn be their sum.

  • (a) Suppose that each X i X_i Xi is Bernoulli with parameter p i p_i pi, and that p 1 , . . . , p n p_1, ... , p_n p1,...,pn are chosen so that the mean of X X X is a given μ > 0 μ> 0 μ>0. Show that the variance of X X X is maximized if the p i p_i pi are chosen to be all equal to μ / n μ/n μ/n.
  • (b) Suppose that each X i X_i Xi is geometric with parameter p i p_i pi, and that p 1 , . . . , p n p_1, ... , p_n p1,...,pn are chosen so that the mean of X X X is a given μ > 0 μ> 0 μ>0. Show that the variance of X X X is minimized if the p i p_i pi are chosen to be all equal to n / μ n/μ n/μ.

SOLUTION

  • (a) We have
    v a r ( X ) = ∑ i = 1 n v a r ( X i ) = ∑ i = 1 n p i ( 1 − p i ) = μ − ∑ i = 1 n p i 2 var(X) =\sum_{i=1}^nvar(X_i) =\sum_{i=1}^np_i(1-p_i)=\mu-\sum_{i=1}^np_i^2 var(X)=i=1nvar(Xi)=i=1npi(1pi)=μi=1npi2Thus maximizing the variance is equivalent to minimizing ∑ i = 1 n p i 2 \sum_{i=1}^np_i^2 i=1npi2. It can be seen that
    ∑ i = 1 n p i 2 = ∑ i = 1 n ( μ / n ) 2 + ∑ i = 1 n ( p i − μ / n ) 2 , \sum_{i=1}^np_i^2=\sum_{i=1}^n(\mu/n)^2+\sum_{i=1}^n(p_i-\mu/n)^2, i=1npi2=i=1n(μ/n)2+i=1n(piμ/n)2,so ∑ i = 1 n p i 2 \sum_{i=1}^np_i^2 i=1npi2 is minimized when p i = μ / n p_i=\mu/n pi=μ/n for all i i i.
  • (b) We have
    μ = ∑ i = 1 n E [ X i ] = ∑ i = 1 n 1 p i μ=\sum_{i=1}^nE[X_i] =\sum_{i=1}^n\frac{1}{p_i} μ=i=1nE[Xi]=i=1npi1and
    v a r ( X ) = ∑ i = 1 n v a r ( X i ) = ∑ i = 1 n 1 − p i p i 2 var(X) =\sum_{i=1}^n var(X_i) =\sum_{i=1}^n\frac{1-p_i}{p_i^2} var(X)=i=1nvar(Xi)=i=1npi21piIntroducing the change of variables y i = 1 / p i = E [ X i ] y_i = 1/ p_i = E[X_i] yi=1/pi=E[Xi]. we see that the constraint becomes
    ∑ i = 1 n y i = μ \sum_{i=1}^n y_i =μ i=1nyi=μand that we must minimize
    ∑ i = 1 n y i ( y i − 1 ) = ∑ i = 1 n y i 2 — μ , \sum_{i=1}^n y_i(y_i - 1) = \sum_{i=1}^n y_i^2 —μ, i=1nyi(yi1)=i=1nyi2μ,subject to that constraint. This is the same problem as the one of part (a), so the method of proof given there applies.

Problem 46. Entropy and uncertainty. (熵与不确定性)

  • Consider a random variable X X X that can take n n n values. x 1 . . . . , x n x_1 .... ,x_n x1....,xn, with corresponding probabilities p 1 , . . . , p n p_1, ... ,p_n p1,...,pn. The entropy of X X X is defined to be
    H ( X ) = − ∑ i = 1 n p i l o g p i H(X) = -\sum_{i=1}^n p_i logp_i H(X)=i=1npilogpi(All logarithms in this problem are with respect to base two.)
    • The entropy H ( X ) H(X) H(X) provides a measure of the uncertainty about the value of X X X. To get a sense of this. note that H ( X ) ≥ 0 H(X)\geq0 H(X)0 and that H ( X ) H(X) H(X) is very close to 0 0 0 when X X X is “nearly deterministic.” i.e., takes one of its possible values with probability very close to 1 1 1 (since we have p l o g p ≈ 0 plogp\approx0 plogp0 if either ≈ 0 \approx0 0 or p ≈ 1 p\approx1 p1).
    • The notion of entropy is fundamental in information theory. For
      example. it can be shown that H ( X ) H(X) H(X) is a lower bound to the average number of yes-no questions (such as “is X = x 1 X = x_1 X=x1 ?” or "is X < x 5 X < x_5 X<x5") that must be asked in order to determine the value of X X X. Furthermore, if k k k is the average number of questions required to determine the value of a string of independent identically distributed random variables X 1 , X 2 , . . . . X n X_1, X_2, ... . X_n X1,X2,....Xn, then, with a suitable strategy, k / n k/n k/n can be made as close to H ( X ) H(X) H(X) as desired, when n n n is large.
  • ( a ) (a) (a) Show that if q 1 , . . . , q n q_1,...,q_n q1,...,qn are nonnegative numbers such that ∑ i = 1 n q i = 1 \sum_{i=1}^nq_i=1 i=1nqi=1, then
    H ( X ) ≤ − ∑ i = 1 n p i l o g q i , H( X)\leq-\sum_{i=1}^np_i log q_i, H(X)i=1npilogqi,with equality if and only if p i = q i p_i = q_i pi=qi for all i i i. As a special case, show that H ( X ) ≤ l o g n H(X)\leq log n H(X)logn, with equality if and only if p i = 1 / n p_i = 1/n pi=1/n for all i i i.
    [Hint: Use the inequality l n α ≤ α − 1 ln\alpha\leq\alpha- 1 lnαα1, for α > 0 \alpha> 0 α>0. which holds with equality if and only if α = 1 \alpha = 1 α=1]
  • ( b ) (b) (b) Let X X X and Y Y Y be random variables taking a finite number of values, and having joint PMF p X , Y ( x , y ) p_{X,Y}(x,y) pX,Y(x,y). Define
    I ( X , Y ) = ∑ x ∑ y p X , Y ( x , y ) l o g ( p X , Y ( x , y ) p X ( x ) p Y ( y ) ) I(X, Y) =\sum_x\sum_yp_{X,Y}(x, y) log(\frac{p_{X,Y}(x,y)}{p_X(x)p_Y(y)}) I(X,Y)=xypX,Y(x,y)log(pX(x)pY(y)pX,Y(x,y))Show that I ( X , Y ) ≥ 0 I(X, Y) \geq 0 I(X,Y)0, and that I ( X . Y ) = 0 I(X. Y) = 0 I(X.Y)=0 if and only if X X X and Y Y Y are independent.
  • ( c ) (c) (c) Show that
    I ( X , Y ) = H ( X ) + H ( Y ) − H ( X , Y ) , I(X, Y) = H(X) + H(Y) - H(X , Y ) , I(X,Y)=H(X)+H(Y)H(X,Y),where
    H ( X , Y ) = − ∑ x ∑ y p X , Y ( x , y ) l o g p X , Y ( x . y ) , H ( X ) = − ∑ x p X ( x ) l o g p X ( x ) ,      H ( Y ) = − ∑ y p Y ( y ) l o g p Y ( y ) H (X, Y) = -\sum_x\sum_yp_{X,Y}(x,y) logp_{X,Y}(x. y),\\ H(X) = - \sum_xp_X(x) log p_X(x),\ \ \ \ H(Y) = -\sum_y p_Y (y) logp_Y (y) H(X,Y)=xypX,Y(x,y)logpX,Y(x.y),H(X)=xpX(x)logpX(x),    H(Y)=ypY(y)logpY(y)
  • ( d ) (d) (d) Show that
    I ( X , Y ) = H ( X ) − H ( X ∣ Y ) , I(X, Y) = H(X) - H(X | Y), I(X,Y)=H(X)H(XY),where
    H ( X ∣ Y ) = − ∑ y p Y ( y ) ∑ x p X ∣ Y ( x ∣ y ) l o g p X ∣ Y ( x ∣ y ) . H(X |Y) = -\sum_y p_Y(y)\sum_x p_{X|Y}(x | y) log{p_{X|Y}}(x | y). H(XY)=ypY(y)xpXY(xy)logpXY(xy).[Note that H ( X ∣ Y ) H (X | Y) H(XY) may be viewed as the conditional entropy of X X X given Y Y Y, that is, the entropy of the conditional distribution of X X X, given that Y = y Y = y Y=y, averaged over all possible values y y y. Thus. the quantity I ( X , Y ) = H ( X ) − H ( X ∣ Y ) I(X, Y) = H(X) - H(X | Y) I(X,Y)=H(X)H(XY) is the reduction in the entropy (uncertainty) on X X X, when Y Y Y becomes known. It can be therefore interpreted as the information about X X X that is conveyed by Y Y Y. and is called the mutual information of X X X and Y Y Y. (相互包含的信息量)]

题目不难,但是推导的结论很有意思

猜你喜欢

转载自blog.csdn.net/weixin_42437114/article/details/113573990