From a probability model to a logical classification

I'll look at today is derived based on a probabilistic model to derive question about classification.

The title probably, we will extract a sample, and then to determine which samples should belong to this classification.

First, review the approximate probability theory associated with the knowledge of some basic knowledge of probability theory

We classify the problem is limited to two categories. That we have a \ (C_1 \) and \ (C_2 \) classification. Then one sample \ (X_i \) , to judge \ (X_i \) which category should belong to. With probability formula to describe our problem
\ (P (C_ |? X_i ) \) That \ (P (C_1 | X_i) = 1-P (C_2 | X_i) \) then we can just find one probability .

Bayesian formula, we found \ (P (C_1 | X_i) = \ frac {P (X_i | C_1) * P (C_1)} {P (X_i | C_1) * P (C_1) + P (X_i | C_2 ) * P (C_2)} \ )

We simple conversion formula: dividing the numerators and denominators of molecules can be \ (P (C_1 | X_i) = \ frac {1} {1+ \ frac {P (X_i | C_2) * P (C_2)} {P (X_i | C_1) * P (C_1)}} \)

We set: \ (the Z = LN (\ FRAC {P (X_i | C_l) P (C_l)} {P (X_i | C_2) P (C_2)}) \) to give the \ (P (C_1 | X_i) = \ frac {1} {1 + exp (-Z)} \)

We further transformed to Z: \ (Z = LN \ FRAC {P (X_i | C_l)} {P (X_i | C_2)} + LN \ FRAC {P (C_l)} {P (C_2)} \)

The \ (\ frac {P (C_1 )} {P (C_2)} \) we get is primarily the number of learning samples to check, \ (C_1 \) number is \ (N_1 \) , \ (C_2 \) the number of \ (N_2 \) , to thereby obtain
\ [\ frac {P (C_1 )} {P (C_2)} = \ frac {\ frac {N_1} {N_1 + N_2}} {\ frac {N_2} {N_1 + N_2}} = \ frac { N_1} {N_2} \]

Can be obtained into Z \ [Z = ln \ frac { P (X_i | C_1)} {P (X_i | C_1)} + ln \ frac {N_1} {N_2} \]

We assume that our samples are Gaussian (normal) distribution, the probability density formula of the Gaussian distribution is as follows
\ [f (x) = \ frac {1} {\ sqrt {2 \ pi} \ sigma} exp (\ frac { (xu) ^ 2} {2 \ sigma ^ 2}) \]

即:
\[P(X_i|C_1) = \frac{1}{\sqrt{2\pi}\sigma_1} exp(\frac{(x_i-u_1)^2}{2\sigma_1^2})\]
\[P(X_i|C_2) = \frac{1}{\sqrt{2\pi}\sigma_2} exp(\frac{(x_i-u_2)^2}{2\sigma_2^2})\]

Into the future, after we got the original simplifies to
\ [Z = ln \ frac { \ frac {1} {\ sigma_1}} {\ frac {1} {\ sigma_2}} + ln \ frac {exp (\ frac { (x_i-u_1) ^ 2} {2 \ sigma_1 ^ 2})} {exp (\ frac {(x_i-u_2) ^ 2} {2 \ sigma_2 ^ 2})} + ln \ frac {N_1} {N_2} \]

Further simplify

\ [Z = ln \ frac {\ sigma_2} {\ sigma_1} + \ frac {(x-U_1) ^ 2} {2 \ sigma_1 ^ 2} + \ frac {(x-U_2) ^ 2} {2 \ sigma_2 ^ 2} + ln \ frac N_1 {} {} N_2 \]

We assume that the variance of the Gaussian distribution is equivalent to two classifications, namely \ (\ sigma_1 == \ sigma_2 == \ sigma \) to give

\ [Z \ begin {-2U_1X + + 2U_2X U_1 ^ 2 ^ 2-u_2} {2 \ sigma ^ 2} + ln \ frac {} {N_1 N_2} \ begin {u_2-U_1} {\ sigma ^ 2 X} + \ begin {U_1 ^ 2 ^ 2-u_2} {2 \ sigma ^ 2} + ln \ frac {} {N_1 N_2} \]

And we know U \ (\ Sigma \) are obtained according to the statistical sample, we do not their relationship specific numerical values, so we can assume \ (W = \ frac {U_2 -U_1} {\ Sigma ^ 2} \) and \ (b = \ frac {U_1 ^ 2-U_2 ^ 2} {2 \ sigma ^ 2} + ln \ frac {N_1} {N_2} \)

We can get \ [Z = Wx + b \ ]

So our machine learning if we use the maximum likelihood function, to obtain the mean and variance of the sample, we will be able to estimate whether a sample belongs to a classification by varying, we can not calculate the mean of the sample according to the characteristics and distribution variance, just need to find the right W and B, we can determine the classification as a sample belongs.

We just need to find a suitable evaluation function, we find to assess the suitability of W and b. Then continue to test W and b we can find. And we're looking for this function is as follows:

\ [Loss = - [yln \ hat {y} + (1-y) ln (1- \ hat {y})] \]

As for why choose such a loss function? Course Andrew Ng teacher machine learning about the mention can be divided into two cases, respectively, to calculate
\ [loss = \ begin {cases } -ln \ hat {y} \ quad y = 1 \\ -ln (1- \ hat {y}) \ quad y = 0 \ end {cases} \]

Integration of these two formulas into a get the above loss function, we do not apply \ (loss = (y- \ hat {y}) ^ 2 \) as a loss of function? Andrew Ng explained that the teacher is probably because he is not a convex function in logistic regression, we can not be a good gradient descent. CHANG teacher courses are analyzed in detail.

Specific follows: Join Now select the \ (loss = (y- \ hat {y}) ^ 2 \) as a loss function, when we find the gradient descent, needs to obtain the derivative of the function

\(\frac{dloss}{dw}=2(y-\hat{y})\hat{y}(1-\hat{y})*X\)

When the output we get 0 or 1 time, regardless of label is 0 or 1, to obtain the derivative is 0, the gradient can not be dropped. This is the fundamental reason for you.

Guess you like

Origin www.cnblogs.com/bbird/p/11527763.html