The use of softmax layer

what is softmax

In the classification task, we get the output vector vec = ( v 1 , v 2 , ⋯ , vn ) vec=(v_1, v_2, \cdots, v_n) through a modelvec=(v1,v2,,vn) , often not a probability distribution, that is,∑ i = 1 nvi ≠ 1 \sum_{i=1}^nv_i \neq1i=1nvi=1 . This is an unintuitive result in our eyes. In order to solve this problem, we use the exponential function to get a new vector:
distribution = ( exp ⁡ vi ∑ j = 1 n exp ⁡ vj ) i = 1 n distribution=(\frac{ \exp{v_i}}{\sum_{j=1}^n\exp{v_j}})_{i=1}^ndistribution=(j=1nexpvjexpvi)i=1n
At this time, the distribution vector obviously satisfies the conditions of the probability distribution.

why do you do this

I think one reason for this is that exponential functions are very fast growing functions, and we want the probability of a "principal element" to stand out as much as possible. This design allows us to concentrate the main probability on the larger components of the original data. In other words, through softmax, we widen the gap between the components.

benefit

Intuitive
When single classification can directly lock the preferred category

shortcoming

When performing model training, especially when the number of categories is extremely large (such as 200 categories), since the model parameters are initially set randomly, the probability will be concentrated on random categories. At the same time, it can be seen from the definition that when the softmax value of a certain component approaches 0, its gradient also tends to 0, making it difficult to update the parameters.

suggestion

When the number of classifications is small, a softmax can be added at the end of the classification layer, but it is best not to add it when there are many. In addition, due to the disadvantages, adding a softmax at the end of the model may slow down the training speed of the model in many cases.

Guess you like

Origin blog.csdn.net/Petersburg/article/details/126618654