Machine Learning(7)Neural network —— optimization techniques I

版权声明: https://blog.csdn.net/qq_26386707/article/details/79364579

Machine Learning(7)Neural network —— optimization techniques I


Chenjing Ding
2018/02/27


notation meaning
g(x) activate function
x n the n-th input vector (simplified as x i when n is not specified)
x n i the i-th entry of x n (simplified as x i when n is not specified)
N the number of input vectors
K the number of classes
t n a vector with K dimensional with k-th entry 1 only when the n-th input vector belongs to k-th class, tn = (0,0,…1…0)
y j ( x ) the output of j-th output neural
y ( x ) a output vector of input vector x; y ( x ) = ( y 1 ( x ) . . . y K ( x ) )
W j i τ + 1 the ( τ + 1 )-th update of weight W j i
W j i τ the τ -th update of weight W j i
E ( W ) W i j ( m ) the gradient of m-th layer weight
l i the number of neural in i-th layer(simplified as l when i is not specified)
W j i ( m n ) the weight between layer m and n

1. Regularization

To avoid overfitting:

E ( W ) = i = 1 n L ( t n , y ( x n ) ) + λ Ω ( W )
L ( t n , y ( x n ) ) is a loss function
Ω ( W ) is regularizers:L2 regularizer is | | W | | 2 = i j w j i 2 ; L1 regularizer is | W | = i j | w j i | ;
λ is regularization parameter;
This means every weight w j i can not be too big, thus the model cannot be too complex including so many useless features.

1.what is L1, L2 regularization :
https://www.youtube.com/watch?v=TmzzQoO8mr4 (Chinese)

2.Regularization and Cost Function
https://www.youtube.com/watch?v=KvtGD37Rm5I&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=40

2.Normalizing the Inputs

Convergence is faster if:

  • the mean of all input data is 0
    E ( W ) w j i = y i g E ( W ) y j ,weights can only change together if input vector are all positive ao negative, thus it will lead to slow convergence.
  • the variance of all input data is the same
  • all input data are not correlated if possible (using PCA to decorrelate them)
    if the input are correlated, the direction of steepest descent is not optimal, maybe perpendicular to the direction towards the minimum.

3.Commonly Used Nonlinearities

??????????????The activation function is often nonlinear, here are some.

  • logistic sigmoid

    σ ( a ) = 1 1 + e x p ( a ) ; σ ( a ) = σ ( a ) ( 1 σ ( a ) ) ;

  • tanh

    扫描二维码关注公众号,回复: 3817152 查看本文章
    t a n h ( a ) = 2 σ ( 2 a ) 1 ; t a n h ( a ) = 1 t a n h 2 ( a )

    Advantages: compared with logistic sigmoid

    t a n h ( a ) already centred at zero , thus often converge faster than the standard logistic sigmoid.



figure1 nonlinear activation function(left: logistic sigmoid; right: tanh)

  • softmax

    g i ( a ) = e x p ( a i ) j e x p ( a j ) s o f t m a x ( a + b ) = s o f t m a x ( a ) ( b o t h   a   a n d   b   a r e   v e c t o r s )

  • ReLU

    g ( a ) = m a x { 0 , a } g ( a ) = {     0 ,     e l s e     1 ,   a > 0

    Advantages:

    1. thus gradient will be passed with a constant factor( E ( W ) w j i = y i g E ( W ) y j ), make it easier to propagate gradient through deep networks.( Imagine g < 0 , then E ( W ) w j i will be smaller and smaller with the networks deep, finally the gradient will be close to zero)

    2. don’t need to store ReLU output separately
      Reduction of the required memory by half compared with tanh!
      Because of these two features,ReLU has become the de-facto standard for deep networks.

    Disadvantages:

    1. stuck at zero, if the output of ReLU is zero for the input vector, then the concerned gradient can not be passed to next layer down since it is zero.

    2. Offset bias since it is always positive.

  • Leaky ReLU

    g ( a ) = m a x { β a , a }

    Advantages:

    1. avoid “stuck at zero”

    2. weaker offset bias.

  • ELU

    g ( a ) = {     a ,     a < 0     e a 1 ,   a >= 0

    no offset bias but needs to store the activation.



    figure2 left:ReLU middle: Leaky ReLU right: ELU

  • usage of nonlinear function

    1. Output nodes
      2 class classfication: sigmoid
      multi-class classification: softmax
      regression tasks: tanh

    2. Internal nodes
      tanh is better than sigmoid for internal nodes since it is already centered at 0;

4.Weight Initialization

If we normalize all the input data, we also want to reserve the variance of input data because that the output data which is the input data of next layer again will have the same variance. As a result, convergence will be faster.
Thus our goal is to let variance of input data and output data be same.

y j ( x ) = i = 1 l w j i x i V a r ( w j i x i ) = E ( x i ) 2 V a r ( w j i ) + E ( w j i ) 2 V a r ( x i ) + V a r ( w j i ) V a r ( x i )

if the mean of input data and weights are zero and they are identical independent distributed
V a r ( w j i x i ) = V a r ( w j i ) V a r ( x i ) V a r ( y j ( x ) ) = i = 1 l V a r ( w j i ) V a r ( x i ) = l V a r ( w j i ) V a r ( x i )

4.1 Glorot Initialization

If V a r ( w j i ) = 1 l i n V a r ( y j ( x ) ) = V a r ( x i ) ; l i n is the number of input neural linked to j-th output neural. If we do the same for the backpropagated gradient ( l = l o u t ), then V a r ( w j i ) = 1 l o u t .
The glorot initialization is:

V a r ( w j i ) = 2 l i n + l o u t

4.2 He Initialization

The glorot initialization was based on tanh(centred at 0), He et al. made the derivations, proposed to use instead based on ReLU:

V a r ( W ) = 2 l i n

5.Stochastic and Batch learning

In Gradient Descent, the last step is to adjust weights in the direction of gradients. The equation is:

5.1 Batch learning

Process the full data at once to computer the gradient.

E ( W ) = i = 1 N E i ( W ) w j i τ + 1 = w j i τ η E ( W ) W j i τ

5.2 Stochastic learning

Choose a single training sample x n to obtain E n ( W ) ;

w j i τ + 1 = w j i τ η E n ( W ) W j i τ

5.3 Stochastic vs. Batch Learning

5.3.1 Batching learning advantages
  • Many acceleration techniques (e.g., conjugate gradients) only operate in batch learning.
  • Theoretical analysis of the weight dynamics and convergence rates are simpler.
5.3.2 Stochastic advantages
  • Usually much faster than batch learning.
  • Often results in better solutions.
  • Can be used for tracking changes.

5.4 Minibatch

Minibatch combine two methods above together, process only a small batch of training examples together.

5.4.1 Advantages
  • more stable than stochastic learning but faster than batch learning
  • take advantage of redundancies in training data(some training sample can appear in different mini batches)
  • the input data is matrix since it’s the combination of input vector, and matrix operations is more efficient than vector operations
5.4.2 Caveat

The error function needs to be normalized by minibatch size because we want to keep the learning rate same in different minibatches. Suppose M is the minibatch size:

E ( W ) = 1 M i = 1 M E i ( W ) + λ M Ω ( W )

猜你喜欢

转载自blog.csdn.net/qq_26386707/article/details/79364579