Machine Learning and High-Dimensional Information Retrieval - Note 2 - Statistical Decision Making and Machine Learning

2. Statistical decision-making and machine learning

The basic problem is that for a random variable X ∈ R p X\in \mathbb{R}^{p}XRFor some observations in p , we want to obtain the random variable YYThe "most likely" value of Y. For simplicity, we assumeYYY inR \mathbb{R}Implementation in R. We further assume that the resulting joint probability densityppX,Y(x,y ) has been given.

Example 2.1
XXX is a (vector) image of a person. In Figure 2.1 below, this observation represents an image containing only one pixel, which has varying degrees of gray.

Y = { 1 : X  is picture of person  A 0 : X  is not  Y= \begin{cases}1: & X \text { is picture of person } \mathrm{A} \\ 0: & X \text { is not }\end{cases} Y={ 1:0:X is picture of person AX is not 

Its purpose is to predict given XXXYYY. _ If there is a person calledA \mathrm{A}Person A , he mainly wears dark clothes

Insert image description here

Figure 2.1: Two discrete random variables XXXYYJoint probability distribution of Y

It is reasonable to believe that the respective probabilities of events are expressed as: p 42 > p 32 > p 22 > p 12 p_{42}>p_{32}>p_{22}>p_{12}p42>p32>p22>p12 p 11 > p 21 > p 31 > p 41 p_{11}>p_{21}>p_{31}>p_{41} p11>p21>p31>p41.

To determine what "most likely" means, we consider the squared loss function 1 { }^{1}1

1 { }^{1} 1Note that other loss functions are possible and do make sense, the squared distance was chosen simply for convenience reasons.

L ( Y , f ( X ) ) = ( Y − f ( X ) ) 2 . (2.1) L(Y, f(X))=(Y-f(X))^{2} .\tag{2.1} L ( Y ,f(X))=(Yf(X))2.(2.1)

We want to choose fff , making the expected value of the loss function, the so-called expected prediction error
EPE ⁡ ( f ) = E [ L ( Y , f ( X ) ) ] \operatorname{EPE}(f)=\mathbb{E}[L( Y, f(X))]E P E ( f )=E[L(Y,f ( X ) ) ]
is the smallest. A simple calculation using the tools described in subsection 1.2 gives

EPE ⁡ ( f ) = E [ L ( Y , f ( X ) ) ] = E [ ( Y − f ( X ) ) 2 ] = ∫ R p × R ( y − f ( x ) ) 2 p X , Y ( x , y ) d x   d y = ∫ R p ∫ R ( y − f ( x ) ) 2 p Y ∣ X = x ( y ) d y ⏟ = : E Y ∣ X = x [ ( Y − f ( X ) ) 2 ] p X ( x ) d x = E X E Y ∣ X = x [ ( Y − f ( X ) ) 2 ] \begin{aligned} \operatorname{EPE}(f) &=\mathbb{E}[L(Y, f(X))]=\mathbb{E}\left[(Y-f(X))^{2}\right]\\&=\int_{\mathbb{R}^{p} \times \mathbb{R}}(y-f(x))^{2} p_{X, Y}(x, y) \mathrm{d} x \mathrm{~d} y\\ &=\int_{\mathbb{R}^{p}} \underbrace{\int_{\mathbb{R}}(y-f(x))^{2} p_{Y \mid X=x}(y) \mathrm{d} y}_{=: \mathbb{E}_{Y \mid X=x}\left[(Y-f(X))^{2}\right]}{p_{X}}(x) \mathrm{d} x \\ &=\mathbb{E}_{X} \mathbb{E}_{Y \mid X=x}\left[(Y-f(X))^{2}\right] \end{aligned} E P E ( f )=E[L(Y,f(X))]=E[(Yf(X))2]=Rp×R(yf(x))2pX,Y(x,y)dx dy=Rp=:EYX=x[(Yf(X))2] R(yf(x))2pYX=x( y ) d ypX(x)dx=EXEYX=x[(Yf(X))2]
因此 f ( X ) = a r g min ⁡ c E Y ∣ X = x [ ( Y − c ) 2 ] f(X)=\underset{c}{arg\min }\mathbb{E}_{Y \mid X=x}\left[(Y-c)^{2}\right] f(X)=cargminEYX=x[(Yc)2 ]. SinceE [ ⋅ ] \mathbb{E}[\cdot]linearity of E [ ] , which can be simplified to

f ( X ) = arg ⁡ min ⁡ c ( E Y ∣ X = x [ Y 2 ] − 2 c E Y ∣ X = x [ Y ] + c 2 ) . (2.2) f(X)=\underset{c}{\arg \min }\left(\mathbb{E}_{Y \mid X=x}\left[Y^{2}\right]-2 c \mathbb{E}_{Y \mid X=x}[Y]+c^{2}\right) . \tag{2.2} f(X)=cargmin(EYX=x[Y2]2cEYX=x[Y]+c2).(2.2)

The quadratic term is easily minimized, and the optimal value is
f ( [Y] .\tag{2.3}f(X)=EYX=x[Y].(2.3)

Theorem 2.2

At given XXIn the case of X , if the best prediction is measured as the square of the loss, forYYThe best predictor of Y is the conditional mean.

As mentioned above, the conditional mean as the best predictor relies on the fact that we use the squared loss as a quality measure. We can show that if we take the absolute value ∣ Y − f ( X ) ∣ |Yf(X)|Yf ( _ _ _ __1Loss, the best prediction is the conditional median. While a rigorous proof of this statement is beyond the scope of this lecture note, we can easily see that the median is empirically ℓ 1 \ell_{1}1The optimal value that minimizes the loss

arg ⁡ min ⁡ c 1 n ∑ i ∣ y i − c ∣ . (2.4 ) \underset{c}{\arg \min } \frac{1}{n} \sum_{i}\left|y_{i}-c\right| . \tag{2.4 } cargminn1iyic.(2.4 )

We only need the result of a non-smooth convex optimization, that is, for a convex function with subgradient, if and only if c ∗ c^{*}cc ∗ c^{*}can only be achieved when the sub-gradient at ∗ contains 0cThe minimum value of ∗ . In our case, the experienceℓ 1 \ell_{1}1loss relative to ccThe sub-gradient of c is 1 n ∑ i sign ⁡ ( yi − c ) \frac{1}{n} \sum_{i} \operatorname{sign}\left(y_{i}-c\right)n1isign(yic ) forc ≠ yic \neq y_{i}c=yi { 1 n ∑ i ≠ j sign ⁡ ( y i − c ) + t ∣ − 1 ≤ t ≤ 1 } \left\{\frac{1}{n} \sum_{i \neq j} \operatorname{sign}\left(y_{i}-c\right)+t \mid-1 \leq t \leq 1\right\} { n1i=jsign(yic)+t1t1 } forc = yjc=y_{j}c=yj. Therefore, we see that only if it is less than ccc y i y_{i} yiThe number of yi y_{i} greater than cyiWhen the quantities of are the same, ccOnly the sub-gradient at c contains zero. in odd numbernnIn the case of n ,ccc must be the same asyi y_{i}yi( n + 1) / 2 (n+1)/2(n+The maximum value of 1 ) / 2 coincides with the maximum value. This is exactly the definition of median.

2.1 General settings for supervised decision-making and generalization error

Let us resume the above chapters from a higher level perspective. We introduce a loss function based on training samples (xi, yi), i = 1, …, N \left(x_{i}, y_{i}\right), i=1, \ldots, N(xi,yi),i=1,,N , can be obtained from a given function classF \mathcal{F}Learning the best prediction functionff in Ff . Such methods are called supervised learning methods because they require a training set with each samplexi x_{i}xiThere is a corresponding yi y_{i}yi. A general problem statement is:

( X , Y ) ∈ R p (X, Y)\in \mathbb{R}^{p} (X,Y)RRandom variables in p , let F \mathcal{F}F comes fromR p ⇒ R \mathbb{R}^{p}\Rightarrow\mathbb{R}RpA class of functions in R. These functions can be parameterized with limited parameters, such asΘ ∈ RM \Theta \in \mathbb{R}^{M}ThRM. _ Furthermore, letL : R × R → R 0 + L: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}_{0}^{+}L:R×RR0+is a loss function used to measure YYY andf ( X ) f(X)Deviation of f ( X ) . The purpose of supervised prediction methods is to predict whenf ^ ∈ F \hat{f} \in \mathcal{F}f^Find the minimum expected prediction errorE [ L ( Y , f ( X ) ] \mathbb{E}[L(Y, f(X)] in FE[L(Y,f(X)]

In fact, in order to build a supervised machine learning method, we need to figure out three questions.

  • We have to specify the loss function LLL

  • We must specify the function class F \mathcal{F}F

  • Since in practice we don't know (X, Y) (X, Y)(X,Y ) , we need training samples( xi , yi ) , i = 1 , … , N \left(x_{i}, y_{i}\right), i=1, \ldots, N(xi,yi),i=1,,N to approximate the expected prediction error.

So we get an optimization problem

f ^ = arg ⁡ min ⁡ f ∈ F 1 N ∑ i = 1 N L ( y i , f ( x i ) ) (2.5) \hat{f}=\underset{f \in \mathcal{F}}{\arg \min } \frac{1}{N} \sum_{i=1}^{N} L\left(y_{i}, f\left(x_{i}\right)\right) \tag{2.5} f^=fFargminN1i=1NL(yi,f(xi))(2.5)

These optimization problems are more or less difficult to solve, but in principle they can be fed into the optimization toolbox in the form described above.

Example: Linear Regression
For linear regression, we choose LLL is the quadratic loss, that is,L ( Y , f ( X ) ) = ( Y − f ( X ) ) 2 L(Y, f(X))=(Yf(X))^{2}L ( Y ,f(X))=(Yf(X))2 , we chooseF \mathcal{F}F is an affine function

f ( x ) = θ 0 + ∑ k = 1 p θ kx ( k ) , (2.6) f(x)=\theta_{0}+\sum_{k=1}^{p} \theta_{k} x ^{(k)}, \tag{2.6}f(x)=i0+k=1pikx(k),(2.6)

where x ( k ) x^{(k)}x( k ) is the vectorxxentry for x . ConsideringNN training samples, then the optimization problem is

Θ ^ = arg ⁡ min ⁡ θ 0 , ... , θ p 1 N ∑ i = 1 N ( yi − θ 0 − ∑ k = 1 p θ kxi ( k ) ) (2.7) \hat{\Theta}=\subset{\theta_{0}, \ldots, \theta_{p}}{\arg \min} \frac{1}{N}\sum_{i=1}^ {N}\left(y_{i}-\theta_{0}-\sum_{k=1}^{p}\theta_{k}x_{i}^{(k)}\right)^{2} . . . . \tag{2.7}Th^=i0, ... , ipargminN1i=1N(yii0k=1pikxi(k))2.(2.7)

Definition :
For a given function fff , the difference between experienced loss and expected loss

G N ( f ) = E [ L ( Y , f ( X ) ] − 1 N ∑ i = 1 N L ( y i , f ( x i ) ) . (2.8) G_{N}(f)=\mathbb{E}\left[L(Y, f(X)\right]-\frac{1}{N} \sum_{i=1}^{N} L\left(y_{i}, f\left(x_{i}\right)\right). \tag{2.8} GN(f)=E[ L ( Y ,f(X)]N1i=1NL(yi,f(xi)).(2.8)

called ffThe generalization error of f .

According to the law of large numbers, we know that GN G_{N}GNin NNN approaches 0 as it approaches infinity. However, many times the number of training samples is limited. In addition, the speed of convergence is highly dependent on(X, Y) (X, Y)(X,The (unknown) probability distribution of Y ) . Therefore, one goal of machine learning is to constrain the generalization error with high probability.

In practice we use cross validation (cross entropy) to check ffHow well f matches the data.

2.2 k nearest neighbors

In practice, the problem we face is that we do not know XXXYYjoint distribution of Y , so we must estimate the conditional mean in equation (2.3). Typically this will be achieved by given a specificXXXYYto complete the relative frequency of Y. However, in the continuous case, problems arise, even in many observations( xi , yi ) \left(x_{i}, y_{i}\right)(xi,yi) , there may not bexi = x x_{i}=xxi=x . This problem can be expanded byxxarea around x and consider all xi x_{i}xiClose to expectations xxto solve for the observation of x . This leads tokkThe concept of k nearest neighbors.

Insert image description here

Figure 2.2: Given the events (1, 1), (2, 5), (3, 3), (4, 1), (5, 4) (1,1),(2,5),(3, 3),(4,1),(5,4)(1,1),(2,5),(3,3),(4,1),(5,4 ) Binary distribution(X, Y) (X, Y)(X,Y )。虽然EY ∣ X = 4 [ Y ] = 5 \mathbb{E}_{Y\mid X=4}[Y]=5EYX=4[Y]=5 , averaging over 3 nearest neighbors providesf^k = 3 (4) = (2 + 3 + 5) / 3 = 10 / 3 \hat{f}_{k=3} (4) = (2 +3+5)/3=10/3f^k=34=2+3+5/3=10/3

k k The k nearest neighbor method estimatesfff

f ^ k ( x ) =  average  ( y i ∣ x i ∈ N k ( x ) ) ,  (2.9) \hat{f}_{k}(x)=\text { average }\left(y_{i} \mid x_{i} \in N_{k}(x)\right) \text {, } \tag{2.9} f^k(x)= average (yixiNk(x))(2.9)

Among them, N k ( x ) N_{k}(x)Nk( x ) meansxxkkof xThe set of k nearest neighbors. There are two approximations here.

  1. Expectations are approximated by averaging.

  2. Conditions at a point are approximated by conditions in some region (around the point).

If NNN is the number of observations, so pay attention to the following points. IfNNN increases,xi x_{i}xiwill be close to xxx . Furthermore, ifkkAs k increases, the average will tend to the expected value. More precisely, under mild conditions of joint probability measurement, we have

lim ⁡ N , k → ∞ , k N → 0 f ^ ( x ) = E Y ∣ X = x [ Y ] . (2.10) \lim _{N, k \rightarrow \infty, \frac{k}{N} \rightarrow 0} \hat{f}(x)=\mathbb{E}_{Y \mid X=x}[Y] . \tag{2.10} N,k,Nk0limf^(x)=EYX=x[Y].(2.10)

However, the sample size NNN is usually limited, so the sample size is not sufficient to satisfy the condition.

There are two ways to overcome this problem.

  • to fff imposes model assumptions, for example,f ( x ) ≈ x T β f(x) \approx x^{\mathrm{T}} βf(x)xT βgets linear regression.

  • By "lowering" XXDimensions of X. This leads to the motivation of dimensionality reduction.

找到 f : [ x 1 ⋮ x p ] ↦ [ s 1 ⋮ s k ] f:\left[\begin{array}{c}x_{1} \\ \vdots \\ x_{p} \end{array}\right] \mapsto \left[\begin{array}{c} s_{1}\\ \vdots \\ s_{k} \end{array}\right] f:x1xps1sk, make fff maintains the "intrinsic" distance of the observation.

2.3 Curse of Dimension

X ∈ R p X\in \mathbb{R}^{p} XRp is a random variable. There are a few things to observe.

First of all, XXThe absolute noise ofp ) increases because noise accumulates in all dimensions.

Second, estimate XXThe number of observations required for the densityfunction ofincreases sharply with the increase of p . The density function is usually estimated with the help of the relative frequency of an occurrence. As an example, assume that a NNin a one-dimensional spaceN observations need to estimate a certain density function to a predetermined accuracy. If the observation space increases to 2 dimensions, the number of observations must increase toN 2 N^{2}NA quantity of 2 is required to maintain the same accuracy. For a 3-dimensional space, approximatelyN 3 N^{3}N3 observations, and so on. This means that, in order for the density function to be estimated with the same accuracy, the number of measurements required grows exponentially with the dimensions of the measurement space.

One cause of the curse of dimensionality is the empty space phenomenon (Scott), which states that high-dimensional spaces are inherently sparse. Given that the data points are evenly distributed in a 10-dimensional unit sphere, the probability that a point is more than 0.9 from the center is less than 35%. Therefore, the tail in a high-dimensional distribution is much more important than the tail in a one-dimensional distribution. Given a high-dimensional multivariate random variable, having one component in the tail of its distribution is sufficient for the entire sample to lie in the tail of the common density function. Figure 2.3 illustrates the one-dimensional case.

Insert image description here

Figure 2.3: For a one-dimensional random variable, most observations are around the mean. This is not the case in higher dimensions.

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/122054141