[Introduction to Machine Learning] Cross-entropy loss function and MLE criterion

0 summary

The conclusion is released first for easy viewing, and the detailed derivation can be read in subsequent chapters.

0.1 MLE and cross entropy

The loss function derived by the MLE criterion is equivalent to the cross-entropy loss function, as follows:
J ( θ ) = − E x , y ∼ P data [ log P model ( y ∣ x ; θ ) ] J(\pmb\theta)=- \mathbb{E}_{\pmb x,\pmb y\sim P_{data}}[logP_{model}(\pmb y|\pmb x;\pmb\theta)]J(iii )=Exxx,yyyPdata[logPmodel(yyyxxx;iiθ )]
Supervised learning is given inputx \pmb xxxx (sample) and outputy \pmb yyyy (label) to train the model.

  • From the perspective of the maximum likelihood estimation criterion , this process can be regarded as the conditional probability P ( Y ∣ X ; θ ) P(Y|X;\pmb\theta)P(YX;iiA maximum likelihood estimation process of θ )
  • From the perspective of relative entropy and cross entropy , this process can be seen as adjusting Q ( X ) Q(X)Q ( X ) makes it approximateP (X) P(X)The process of P ( X ) , andH ( P , Q ) H(P,Q)H(P,The minimum value of Q ) isH ( P ) H(P)H ( P ) , then we have the cross entropyH ( P , Q ) H(P,Q)H(P,Q ( X ) Q(X)obtained by minimizingQ ( X ) should be able to approximateP (X) P(X)P(X)

See Section 4 for details

0.2 Specific form of cross entropy loss function

Among the following symbols: NNN is the number of samples,MMM is the number of categories

1. Linear regression problem

J ( θ ) = − E x , y ∼ P d a t a [ l o g P m o d e l ( y ∣ x ; θ ) ] = 1 N ∣ ∣ y − y ^ ∣ ∣ 2 2 \begin{aligned} J(\pmb\theta)&=-\mathbb{E}_{\pmb x,\pmb y \sim P_{data}}[logP_{model}(\pmb y| \pmb x;\pmb\theta)]\\ &=\frac{1}{N}||\pmb y -\pmb{\hat y}||^2_2 \end{aligned} J(iii )=Exxx,yyyPdata[logPmodel(yyyxxx;iii )]=N1yyyy^y^y^22

2.Logistic regression (two classification problem)

J ( θ ) = 1 N ∑ i = 1 N − y i l o g y i ^ − ( 1 − y i ) l o g ( 1 − y i ^ ) J(\pmb\theta)=\frac{1}{N}\sum_{i=1}^N-y_ilog\hat{y_i}-(1-y_i)log(1-\hat{y_i}) J(iii )=N1i=1Nyilogyi^(1yi)log(1yi^) where:

  • y i y_i yiIs 0 or 1, if the category is A, it is 1, if the category is B, it is 0
  • y i ^ \hat{y_i} yi^is the probability of classifying the sample as A output by the model, then ( 1 − yi ^ ) (1-\hat{y_i})(1yi^) is the probability of being classified as B,yi ^ = σ sigmoid ( θ T xi + b ) \hat{y_i}=\sigma_{sigmoid}(\pmb\theta^T\pmb{x_i}+b)yi^=psigmoid(iiiTxixixi+b)

3.Multiple classification problems

J ( θ ) = − 1 N ∑ i = 1 N ∑ j = 1 M y i , j l o g y ^ i , j J(\pmb\theta)=-\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^{M}y_{i,j}log\hat{y}_{i,j} J(iii )=N1i=1Nj=1Myi,jlogy^i,jin:

  • y i , j y_{i,j} yi,jIs 0 or 1, if category jjj fits sampleiii takes the value 1, otherwise it takes the value 0
  • y ^ i , j \hat{y}_{i,j} y^i,jFor the model output, sample iii is classified into categoryjjj的概率, y ^ i , j = σ s o f t m a x ( [ θ T x i + b ] j ) \hat{y}_{i,j}=\sigma_{softmax}([\pmb\theta^T\pmb{x_i}+\pmb b]_{j}) y^i,j=psoftmax([iiiTxixixi+bbb]j)
  • [ θ T x i + b ] j [\pmb\theta^T\pmb{x_i}+\pmb b]_{j} [iiiTxixixi+bbb]jis the jjth output layerj outputs, interpretable as categoriesjjlog probability of j

1 Maximum likelihood estimation MLE

1.1 Likelihood function and log-likelihood function

Given a probability distribution P ( x ) P(x)P ( x ) , assuming that the distribution is governed by a certain set of parametersθ \pmb\thetaiiθ is determined, then the probability distribution can be written asP ( x ; θ ) P(x;\pmb\theta)P(x;iiθ )form.

  • Change the parameter θ \pmb\thetaiiθ is fixed, andxxIf x is regarded as a variable, thenP ( x ; θ ) P(x;\pmb\theta)P(x;iiθ )is called the probability distribution, which can be regarded as the parameterθ \pmb\thetaiiThe specific probability distribution when θ takes a certain set of specific values.
  • Will xxx is fixed, andθ \thetaθ is regarded as a variable, thenL ( θ ) = P ( x ; θ ) L(\pmb\theta)=P(x;\pmb\theta)L(iii )=P(x;iiθ )is called the likelihood function. The likelihood function can be viewed astheWhen x has been obtained, the parameter θ \thetaθ under different values​​xxThe probability of x happening.

For a set of independent and identically distributed data x = ( x 1 , x 2 , . . . , xn ) T \pmb{x}=(x_1,x_2,...,x_n)^Txxx=(x1,x2,...,xn)T , its joint distribution can usually be written in the form of a continuous product:L ( θ ) = P ( x ; θ ) = ∏ i = 1 n P ( xi ; θ ) L(\pmb\theta)=P(\pmb{x };\pmb\theta)=\prod_{i=1}^{n}P(x_i;\pmb\theta)L(iii )=P(xxx;iii )=i=1nP(xi;iiThe more commonly used form of θ )is to take the logarithm of the likelihood function, and the continuous multiplication can be converted into a summation symbol. This function is called the log-likelihood function: LL ( θ ) =log P ( x ; θ ) = ∑ i = 1 nlog P ( xi ; θ ) LL(\pmb\theta)=logP(\pmb{x};\pmb\theta)=\sum_{i=1}^{n}logP(x_i;\pmb\theta)LL(iii )=logP(xxx;iii )=i=1nlogP(xi;iiθ )Since the logarithmic function is a monotonically increasing function, taking the logarithm will not affect the position of the extreme value.

1.2 Maximum Likelihood Estimation

As mentioned above "the likelihood function can be viewed as theWhen x has been obtained, the parameter θ \thetaθ under different values​​xxThe probability of x happening. "Then the idea of ​​the maximum likelihood estimation criterion is:
I want to estimate θ \thetaθ , since we already have such a set of observation datax = ( x 1 , x 2 , . . . , xn ) T \pmb{x}=(x_1,x_2,...,x_n)^Txxx=(x1,x2,...,xn)T , then maximize the probability of its occurrence, and then take the correspondingθ \thetaθ is good as an estimate.
Expressed as a mathematical formula:θ ^ ML = arg max ⁡ θ L ( θ ) = arg max ⁡ θ ∑ i = 1 nlog P ( xi ; θ ) \hat{\pmb{\theta}}_{ML} =\argmax_{\theta}L(\pmb\theta)=\argmax_{\theta}\sum_{i=1}^{n}logP(x_i;\pmb\theta)iii^ML=iargmaxL(iii )=iargmaxi=1nlogP(xi;iiθ )
After averaging the above formula, it can be equivalently written as:
θ ^ ML = arg min ⁡ θ [ − E [ log P ( xi ; θ ) ] ] \hat{\pmb{\theta}}_{ML}=\argmin_ {\theta}[-\mathbb{E}[logP(x_i;\pmb\theta)]]iii^ML=iargmin[E[logP(xi;iiθ )]]
In the process of parameter estimation using the maximum likelihood criterion, the likelihood function is used to estimate the parameterθ \pmb\thetaiiθ is differentiated and then the maximum value is obtained. The solution process will not be described here.
As can be seen from this overview of the solution process, using the datax \pmb xxxx solves forθ \pmb\thetaiiThe premise of θ is the functionP ( x ; θ ) P(x;\pmb\theta)P(x;iiθ )is correct, otherwise the function cannot correctly reflectx \pmb xxxx andθ \pmb\thetaiiThe real relationship of θ , then forθ \pmb\thetaiiIt is impossible to estimate θ . This property is described in "Deep Learning" as follows:

  • True distribution pdata p_{data}pdataMust be in the model family pmodel ( ⋅ ; θ ) p_{model}(·;\pmb\theta)pmodel(;iiθ ). Otherwise, there is no estimate to restorepdata p_{data}pdata
  • True distribution pdata p_{data}pdataMust correspond to exactly one θ \pmb\thetaiiθ value. Otherwise, the maximum likelihood estimate recovers the true distributionpdata p_{data}pdataFinally, it is impossible to decide which θ \pmb\theta to use in the data generation process.iii .

In layman's terms, the shape of the function must be chosen correctly, and which parameters should be chosen as well.

2 Relative entropy and cross entropy

2.1 Relative entropy

Relative entropy, also known as Kullback-Leibler divergence or information divergence, is an asymmetric measure of the difference between two probability distributions. In information theory, relative entropy is equivalent to the difference in information entropy of two probability distributions. ——Baidu Encyclopedia "Relative Entropy"

Definition of relative entropy: D ( P ∣ ∣ Q ) = E x ∼ P ( X ) [ log P ( x ) Q ( x ) ] = ∑ x ∈ XP ( x ) log P ( x ) Q ( x ) = ∑ x ∈ XP ( x ) log P ( x ) − ∑ x ∈ X)}[log\frac{P(x)}{Q(x)}] \\ &=\sum_{x\in X}P(x)log\frac{P(x)}{Q(x) } \\ &=\sum_{x\in X}P(x)logP(x)-\sum_{x\in X}P(x)logQ(x) \end{aligned}D(PQ)=ExP(X)[logQ(x)P(x)]=xXP(x)logQ(x)P(x)=xXP(x)logP(x)xXP(x)logQ(x)where E x ∼ P ( X ) \mathbb{E}_{x\sim P(X)}ExP(X)means that it obeys P ( X ) P(X)P ( X ) distribution ofxxTo calculate the mathematical expectation of x , let H ( P ) = − ∑ x ∈ XP ( x ) log P ( x ) H ( P , Q ) = − ∑ x ∈ XP ( x ) log Q ( x ) H(P)=- \sum_{x\in X}P(x)logP(x)\\ H(P,Q)=-\sum_{x\in X}P(x)logQ(x)H(P)=xXP(x)logP(x)H(P,Q)=xXP ( x ) l o g Q ( x ) relative entropy can be written in the following form:D ( P ∣ ∣ Q ) = H ( P , Q ) − H ( P ) D(P||Q)=H(P,Q )-H(P)D(PQ)=H(P,Q)H ( P )
When the logarithm is based on base 2, the formulas and symbols have the following physical meanings:

  • H ( P ) H(P) H ( P ) is subject toP (X) P(X)SourceXX of P ( X ) distributionThe information entropy ofAverage number of bits required for X encoding
  • H ( P , Q ) H(P,Q) H(P,Q ) is calledcross entropy, which means that the pair obeysP ( X ) P(X)SourceXX of P ( X ) distributionX , according to the distributionQ ( X ) Q(X)Q ( _ _ __Q ( X ) means obeyingP (X) P(X)XX of P ( X ) distributionX degree of difficulty
  • D ( P ∣ ∣ Q ) D(P||Q) D ( P Q ) is the relative entropy, expressed asQ ( X ) Q(X)Q ( X ) comes to sourceXXThe average number of extra bits required for X encoding

2.2 Cross entropy

The definition of cross entropy is derived from relative entropy above:
H ( P , Q ) = − ∑ x ∈ XP ( x ) log Q ( x ) = − E [ log Q ( x ) ] \begin{aligned} H(P, Q)&=-\sum_{x\in X}P(x)logQ(x)\\ &=-\mathbb{E}[logQ(x)] \end{aligned}H(P,Q)=xXP(x)logQ(x)=E[logQ(x)]Obviously, when the distribution Q ( X ) Q(X)Q ( X ) approachesP (X) P(X)When P ( X ) H ( P , Q ) H(P,Q)H(P,Q ) tends toH ( P ) H(P)H ( P ) , relative entropyD ( P ∣ ∣ Q ) D(P||Q)D ( P Q ) tends to 0, which means usingQ ( X ) Q(X)Q ( X ) coding also does not have many redundant bits.

It can be proved that D ( P ∣ ∣ Q ) ≥ 0 D(P||Q)\geq0D(PQ)0 , that is,min H (P, Q) = H (P) minH(P,Q)=H(P)m i n H ( P ,Q)=H(P)

In machine learning, our model training is actually a process of parameter estimation. Our process of adjusting the parameters of the model is to adjust the model Q ( X ) Q(X)Q ( X ) to approximate the real dataP (X) P(X)The optimization process of P ( X ) .

3 Give an example of a classifier

The cross-entropy loss function is often used in classification problems. The following is an example of classification.

  • There are N input samples xi , i = 1 , 2 , . . . , N \pmb{x_i},i=1,2,...,Nxixixi,i=1,2,...,N (for example: N pictures)
  • There are M categories yj, j = 1, 2, . . . , M y_j,j=1,2,...,Myj,j=1,2,...,M (for example: cats, dogs and other M animals)
  • P d a t a ( y j ∣ x i ) P_{data}(y_j|\pmb{x_i}) Pdata(yjxixixi) is the distribution to be learned, and one-hot encoding is generally used in multi-classification (for example:P data (cat∣ cat photo) = 1, P data (dog∣ cat photo) = 0 P_{data}(cat|cat's photo)=1,P_{data}(dog|cat photo)=0Pdata( cat | cat photo ) _=1,Pdata( photos of dogs and cats )=0
  • P m o d e l ( y j ∣ x i ; θ ) P_{model}(y_j|\pmb{x_i};\pmb\theta) Pmodel(yjxixixi;iiθ )P data ( yj ∣ xi ) P_{data}(y_j|\pmb{x_i})through learningPdata(yjxixixi) )
    According to the above description, itemiiThe cross entropy of i samples can be expressed as: H i ( P data , P model ) = − ∑ j = 1 MP data ( yj ∣ xi ) log P model ( yj ∣ xi ; θ ) H_{i}(P_{data} ,P_{model})=-\sum_{j=1}^{M}P_{data}(y_j|\pmb{x_i})logP_{model}(y_j|\pmb{x_i};\pmb\theta)Hi(Pdata,Pmodel)=j=1MPdata(yjxixixi)logPmodel(yjxixixi;iiθ )so that we can derive the cross-entropy loss function of this sample:
    J i ( θ ) = H i ( P data , P model ) = − ∑ j = 1 MP data ( yj ∣ xi ) log P model ( yj ∣ xi ) ; θ ) \begin{aligned} J_i(\pmb\theta)&=H_i(P_{data},P_{model})\\ &=-\sum_{j=1}^{M}P_{data}( y_j|\pmb{x_i})logP_{model}(y_j|\pmb{x_i};\pmb\theta) \end{aligned}Ji(iii )=Hi(Pdata,Pmodel)=j=1MPdata(yjxixixi)logPmodel(yjxixixi;iii ).
    In fact, when using one-hot encoding, only one item in the summation sign is non-zero at a time. For example: Suppose P data = (cat, dog, mouse) T P_{data}=(cat, dog, mouse)^TPdata=( cat , dog , mouse )T , then given a photo of a dog, the probability distribution of three animals: cat, dog, and mouse is:
    P data (y = cat, dog, mouse∣ xi = photo of dog) = [0 1 0] P_{data }(y=cat, dog, mouse|\pmb{x_i}=photo of dog)= \begin{bmatrix} 0\\1\\0 \end{bmatrix}Pdata(y=cats , dogs , mice∣ _xixixi=dog photos ) _=010The hypothesis of the inference result of the model is:
    P model ( y = cat, dog, mouse∣ xi = photo of dog; θ ) = [ 0.2 0.6 0.2 ] P_{model}(y=cat, dog, mouse|\pmb{x_i} =photo of dog;\pmb\theta)= \begin{bmatrix} 0.2\\0.6\\0.2 \end{bmatrix}Pmodel(y=cats , dogs , mice∣ _xixixi=dog photos ; _iii )=0.20.60.2Then the actual result of the cross entropy loss function of this sample is:
    J i ( θ ) = − ∑ yj = cat, dog, mouse P data ( yj ∣ xi = dog’s photo) log P model ( yj ∣ xi = dog’s photo; θ ) = − P data ( y = dog∣ xi = photo of dog) log P model ( y = dog∣ xi = photo of dog; θ ) = − 1 ⋅ log 0.6 \begin{aligned} J_i(\pmb \theta)&=-\sum_{y_j=cat, dog, mouse}P_{data}(y_j|\pmb{x_i}=photo of dog)logP_{model}(y_j|\pmb{x_i}=photo of dog ;\pmb\theta)\\ &=-P_{data}(y=dog|\pmb{x_i}=photo of dog)logP_{model}(y=dog|\pmb{x_i}=photo of dog;\ pmb\theta)\\ &=-1·log0.6 \end{aligned}Ji(iii )=yj= cat , dog , mousePdata(yjxixixi=dog photo ) l o g P _model(yjxixixi=dog photos ; _iii )=Pdata(y=dog∣ _xixixi=dog photo ) l o g P _model(y=dog∣ _xixixi=dog photos ; _iii )=1log0.6

From this , the calculation formula of the cross-entropy loss function in multi-classification problems can be derived : J i ( θ ) = − ∑ j = 1 M yj , ilog P j , i J_i(\pmb\theta)=-\sum_{j=1 }^{M}y_{j,i}logP_{j,i}Ji(iii )=j=1Myj,ilogPj,i

  • y j , i y_{j,i} yj,iIs 0 or 1, if category jjj fits sampleiii takes the value 1, otherwise it takes the value 0
  • P j , i P_{j,i} Pj,iFor the model output, sample iii is classified into categoryjjThe probability of j can be obtained using the softmax unit

When the classification problem is a binary classification , it can be simplified to: J i ( θ ) = − yilog P i − ( 1 − yi ) log ( 1 − P i ) J_i(\pmb\theta)=-y_ilogP_i-(1-y_i )log(1-P_i)Ji(iii )=yilogPi(1yi)log(1Pi)

  • y i y_i yiIs 0 or 1, if the category is A, it is 1, if the category is B, it is 0
  • P i P_i Piis the probability of classifying the sample as A output by the model, then ( 1 − P i ) (1-P_i)(1Pi) is the probability of being classified as B, which can be obtained by using the sigmoid unit

4 Cross-entropy loss function and MLE criterion

When we specify the output of a model P model ( y ∣ x ) P_{model}(\pmb y|\pmb x)Pmodel(yyyxxWhen __
J ( θ ) = − E [ log P model ( y ∣ x ; θ ) ] J(\pmb\theta)=-\mathbb{E}[logP_{model}(\pmb y|\pmb x;\pmb\ theta)]J(iii )=E[logPmodel(yyyxxx;iiθ )]The mathematical expectation (mean) is for the training data distributionP data P_{data}Pdataof samples and labels, so it can be written as J ( θ ) = − E x , y ∼ P data [ log P model ( y ∣ x ; θ ) ] J(\pmb\theta)=-\mathbb{E}_{ \pmb x,\pmb y\sim P_{data}}[logP_{model}(\pmb y|\pmb x;\pmb\theta)]J(iii )=Exxx,yyyPdata[logPmodel(yyyxxx;iiθ )]
Supervised learning is given inputx \pmb xxxx (sample) and outputy \pmb yyyy (label) to train the model.

  • From the perspective of the maximum likelihood estimation criterion , this process can be regarded as the conditional probability P ( Y ∣ X ; θ ) P(Y|X;\pmb\theta)P(YX;iiA maximum likelihood estimation process of θ )
  • From the perspective of relative entropy and cross entropy , this process can be seen as adjusting Q ( X ) Q(X)Q ( X ) makes it approximateP (X) P(X)The process of P ( X ) , andH ( P , Q ) H(P,Q)H(P,The minimum value of Q ) isH ( P ) H(P)H ( P ) , then we have the cross entropyH ( P , Q ) H(P,Q)H(P,Q ( X ) Q(X)obtained by minimizingQ ( X ) should be able to approximateP (X) P(X)P(X)

Different models will have different P model ( y ∣ x ) P_{model}(\pmb y|\pmb x)Pmodel(yyyxxx ), therefore different specific loss function shapes will be derived.

4.1 Linear regression

Given feature h \pmb hhhh , linear unit outputy ^ = θ T h + b \pmb {\hat y} = \pmb\theta^T\pmb h +\pmb by^y^y^=iiiThhh+bbb , the model probability isP model ( y ∣ x ; θ ) = N ( y ; y ^ , I ) P_{model}(\pmb y|\pmb x;\pmb\theta)=\mathcal{N}(\ pmb y;\pmb{\hat y},\pmb I)Pmodel(yyyxxx;iii )=N(yyy;y^y^y^,III),即有
P m o d e l ( y ∣ x ; θ ) = 1 2 π N e − 1 2 ( y − y ^ ) T ( y − y ^ ) = 1 2 π N e − 1 2 ∑ i = 1 N ( y i − y ^ i ) 2 \begin{aligned} P_{model}(\pmb y| \pmb x;\pmb\theta)&=\frac {1}{\sqrt{2\pi}^N}e^{-\frac{1}{2}(\pmb y-\pmb{\hat y})^T(\pmb y - \pmb{\hat y})}\\ &=\frac {1}{\sqrt{2\pi}^N}e^{-\frac{1}{2}\sum_{i=1}^{N}(y_i-\hat y_i)^2} \end{aligned} Pmodel(yyyxxx;iii )=2 p.m N1e21(yyyy^y^y^)T(yyyy^y^y^)=2 p.m N1e21i=1N(yiy^i)2
其者对数似然第二的:
− ln P model ( y ∣ x ; θ ) = N 2 ln 2 π + 1 2 ∑ i = 1 N ( yi − y ^ i ) 2 = N 2 ln 2 π + 1 2 ∣ ∣ y − y ^ ∣ ∣ 2 2 \begin{aligned} -lnP_{model}(\pmb y|\pmb x;\pmb\theta)&=\frac{N}{2}ln2\pi+\frac {1}{2}\sum_{i=1}^{N}(y_i-\hat y_i)^2\\ &=\frac{N}{2}ln2\pi+\frac{1}{2}| |\pmb y-\pmb{\hat y}||^2_2 \end{aligned}lnPmodel(yyyxxx;iii )=2Nl n 2 π+21i=1N(yiy^i)2=2Nl n 2 π+21yyyy^y^y^22
The cost function can be derived (ignoring the constant terms and coefficients 1 2 1\over221):
J ( θ ) = − E x , y ∼ P d a t a [ l o g P m o d e l ( y ∣ x ; θ ) ] = 1 N ∣ ∣ y − y ^ ∣ ∣ 2 2 \begin{aligned} J(\pmb\theta)&=-\mathbb{E}_{\pmb x,\pmb y \sim P_{data}}[logP_{model}(\pmb y| \pmb x;\pmb\theta)]\\ &=\frac{1}{N}||\pmb y -\pmb{\hat y}||^2_2 \end{aligned} J(iii )=Exxx,yyyPdata[logPmodel(yyyxxx;iii )]=N1yyyy^y^y^22
Obviously, the cost function is MSE, so for linear regression problems, the specific cost function derived from the maximum likelihood estimation criterion and the cross-entropy loss function is MSE.

4.2 Logistic regression (two classification problem)

The sigmoid function can compress the output data to the (0,1) interval, which can be interpreted as probability. The function form is as follows σ ( z ) = 1 1 + e − z \sigma(z)=\frac{1}{1+e^{-z}}σ ( z )=1+ez1The third example is as follows
: 1 − yi ) log P model ( 0 ∣ ​​xi ; θ ) = 1 N ∑ i = 1 N − yilogyi ^ − ( 1 − yi ) log ( 1 − yi ^ ) \begin{aligned} J(\pmb\theta) &=\frac{1}{N}\sum_{i=1}^NJ_i(\pmb\theta)\\ &=\frac{1}{N}\sum_{i=1}^N-y_ilogP_{model }(1|\pmb{x_i};\pmb\theta)-(1-y_i)logP_{model}(0|\pmb{x_i};\pmb\theta)\\ &=\frac{1}{N }\sum_{i=1}^N-y_ilog\hat{y_i}-(1-y_i)log(1-\hat{y_i}) \end{aligned}J(iii )=N1i=1NJi(iii )=N1i=1NyilogPmodel(1xixixi;iii )(1yi)logPmodel(0xixixi;iii )=N1i=1Nyilogyi^(1yi)log(1yi^)in:

  • y i y_i yiIs 0 or 1, if the category is A, it is 1, if the category is B, it is 0
  • y i ^ \hat{y_i} yi^is the probability of classifying the sample as A output by the model, then ( 1 − yi ^ ) (1-\hat{y_i})(1yi^) is the probability of being classified as B,yi ^ = σ ( θ T xi + b ) \hat{y_i}=\sigma(\pmb\theta^T\pmb{x_i}+b)yi^=s (iiiTxixixi+b)

4.3 Multi-classification problem

The softmax function is a generalization of the sigmoid function. The function form is as follows σ (zi) = ezi ∑ j = 1 M ezj \sigma(z_i)=\frac{e^{z_i}}{\sum_{j=1}^{M} e^{z_j}}s ( zi)=j=1MezjeziFrom the example in Section 3, we can see that the cost function is
J ( θ ) = 1 N ∑ i = 1 NJ i ( θ ) = − 1 N ∑ i = 1 N ∑ j = 1 M yi , jlog P model ( j ∣ xi ; θ ) = − 1 N ∑ i = 1 N ∑ j = 1 M yi , jlogy ^ i , j \begin{aligned} J(\pmb\theta)&=\frac{1}{N}\sum_ {i=1}^NJ_i(\pmb\theta)\\ &=-\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^{M}y_{i, j}logP_{model}(j|\pmb{x_i};\pmb\theta)\\ &=-\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^ {M}y_{i,j}log\hat{y}_{i,j} \end{aligned}J(iii )=N1i=1NJi(iii )=N1i=1Nj=1Myi,jlogPmodel(jxixixi;iii )=N1i=1Nj=1Myi,jlogy^i,jin:

  • y i , j y_{i,j} yi,jIs 0 or 1, if category jjj fits sampleiii takes the value 1, otherwise it takes the value 0
  • y ^ i , j \hat{y}_{i,j} y^i,jFor the model output, sample iii is classified into categoryjjj的概率, y ^ i , j = σ ( [ θ T x i + b ] j ) \hat{y}_{i,j}=\sigma([\pmb\theta^T\pmb{x_i}+\pmb b]_{j}) y^i,j=s ( [iiiTxixixi+bbb]j)
  • [ θ T x i + b ] j [\pmb\theta^T\pmb{x_i}+\pmb b]_{j} [iiiTxixixi+bbb]jis the jjth output layerj outputs, interpretable as categoriesjjlog probability of j

Guess you like

Origin blog.csdn.net/jasonso97/article/details/112726403