Machine Learning(7)Neural network —— Perceptrons

版权声明: https://blog.csdn.net/qq_26386707/article/details/79343441

Machine Learning(7)Neural network —— Perceptrons


Chenjing Ding
2018/02/21


notation meaning
g(x) activate function
x n the n-th input vector (simplified as x i when n is not specified)
x n i the i-th entry of x n (simplified as x i when n is not specified)
N the number of input vectors
K the number of classes
t n a vector with K dimensional with k-th entry 1 only when the n-th input vector belongs to k-th class, tn = (0,0,…1…0)
y j ( x ) the output of j-th output neural
y ( x ) a output vector of input vector x; y ( x ) = ( y 1 ( x ) . . . y K ( x ) )
W j i τ + 1 the ( τ + 1 )-th update of weight W j i
W j i τ the τ -th update of weight W j i
E ( W ) W i j ( m ) the gradient of m-th layer weight
l i the number of neural in i-th layer
W j i ( m n ) the weight between layer m and n

1. two layers perceptron

1.1 construction

2 layers refers to output layer and input layer; the basic construction of 2 layers perceptron is as followed:



figure1 the construction of 2 layers perceptron

input layer:
d neural, d is the dimensional of an input vector x; The input layer can applied with non-linear basic functions ϕ ( x ) .

Weights:
W j i , j is the index of neural in output layer;i is the index of neural in input layer;

Output layer:
There are k classes, so there are k output functions. The output layer can applied with activate function g(x).

y i ( x ) > y j ( x ) , j i i n p u t   d a t a   x C i l i n e a r   o u t p u t : y j ( x ) = i = 0 d W j i x i     o r   i = 0 d W j i ϕ ( x i ) n o n l i n e a r   o u t p u t : y j ( x ) = g ( i = 0 d W j i x i )     o r     g ( i = 0 d W j i ϕ ( x i ) )

1.2 Learning: How to get W j i

Gradient descent with sequential updating can be used to minimize the error function E(W) to adjust weights.

step1: set up an error function E(W);
if we use L2 loss,

E n ( W ) = 1 2 (   y ( x n ) t n ) 2 = 1 2 l = 1 K ( y l ( x n ) t n l ) 2 = 1 2 l = 1 K ( m = 0 d W l m ϕ ( x m ) t n l ) 2

step2: calculate E n ( W ) W j i ;
E n ( W ) W j i = l = 1 K [ ( y l ( x n ) t n l ) y l ( x n ) W j i ] = (   y j ( x n ) t n j ) ϕ ( x i )

step3: sequential updating, η is the learning rate;
W j i τ + 1 = W j i τ η E n ( W ) W j i = W j i τ η [ (   y j ( x n ) t n j ) ϕ ( x i ) ]     ( D e l t a   r u l e / L M S   r u l e )

Thus, perceptron learning corresponds to Gradient Descent of a quadratic error function.

  1. effor function more details:
  2. sequential updating and delta rule:
  3. Gradient descent

1.3 properties of 2 layers perceptron

  1. it can only represent the linear function since

    y j ( x ) = i = 0 d W j i x i     o r   i = 0 d W j i ϕ ( x i )
    the discriminant boundary is always linear in input space x or input space ϕ ( x ) when input layer applied with ϕ ( x ) , to be specific, the boundary can be a line, a plane and can not be a curve and so on. However, multi layers perceptron with hidden units can represent any continuous functions. 2. multi layers perceptron

  2. ϕ ( x ) and g ( x ) are given before; They are fixed functions.

  3. There is always bias item in the linear discriminant function; (y = ax+b,b is the bias item and it has nothing to do with input x), thus the input layer always have d+1 input neural and the x 0 is always 1, in a result y = a x 1 + b x 0 , x 1 = x , d = 1 ;

2 multi layers perceptron

There are some hidden layers between input layer and output layer.
For example, perceptron with one hidden layer as followed,



figure2 the construction of multi layers perceptron

output:

y k ( x ) = g ( 2 ) [ i = 0 h W k i ( 2 ) g ( 1 ) [ j = 0 d W i j ( 1 ) x j ] ]

In 1.2 we know how to learn the weight of 2 layers perceptron. As the same way, for multi layers, we also need to find the error function and using Gradient Decent to update all weights, but computing the gradient is more complex. So here are 2 main steps:

step1: computing the gradient 2.1Backpropogation
step2: adjusting the weight in the direction of gradient, same as 1.2 step3, we well later focus on some optimization techniques to improve the performance Machine Learning(7)Neural network–optimization techniques

2.1 Backpropagation

2.1.1 How to use backpropagation



figure3 the construction of multi layers perceptron

if the id of layer is m, n and q form top to bottom, the number of neural in each layer is l m , l n and l q ; between 2 layers, the above layer is always the output layer with index of neural j and similarly, the bottom layer is always the input layer with i;

Our goal is to obtain the gradient of E ( W ) W j i ( m n ) :

y j m = g ( z j ( n ) ) z j ( n ) = i = 1 l n W j i ( m n ) y i ( n ) E ( W ) W j i ( m n ) = E ( W ) y j ( m ) y j ( m ) W j i ( m n ) = E ( W ) y j ( m ) y j ( m ) z j ( n ) z j ( n ) W j i ( m n )
thus here are 3 gradients need to be computed to get the result.
E ( W ) z j ( n ) = E ( W ) y j ( m ) y j ( m ) z j ( n ) = E ( W ) y j ( m ) g z j ( n ) W j i ( m n ) = y i ( n ) E ( W ) W j i ( m n ) = E ( W ) y j ( m ) g y i ( n ) E ( W ) y i ( n ) = j = 1 l m E ( W ) z j ( n ) z j ( n ) y i ( n ) = j = 1 l m W j i ( m n ) E ( W ) z j ( n )
Thus, 3 gradients above needs to be calculated between every 2 adjacent layers. Once we got the E ( W ) y j ( m ) from the above m+1 layer then we can obtain E ( W ) W j i ( m n ) and calculate the E ( W ) y i ( n ) to prepare the next layer down calculation for E ( W ) z j ( q ) ; It is called reverse-mode differentiation.

2.1.2 Why use backpropagation with reverse-mode differentiation

For all adjacent layers m and n, There are 2 ways to calculate E ( W ) W i j ( m n ) . To simplify, suppose we want to calculate Z X , one way is to apply operator X to every node, which is called Forward-Mode Differentiation. The other way is to apply operator Z called reverse-mode differentiation;

*figure4 computation graph* *1: Forward - Mode - Differentiate*
*figure5 Forward - Mode - Differentiate computation graph*

e b = e c c b + e d d b = 5
Visiting all red lines only get one gradient which is not efficient.

Forward-mode-differentiate apply operator b to every node, in our cases, the operator is y j ( m ) if the goal is to obtain E ( W ) W j m i ( m   m 1 ) ;the id of first layer down is 0;

E ( W ) W j 1 i ( 10 ) = E ( W ) y j 1 ( 1 ) y j 1 ( 1 ) W i j 1 ( 10 ) = E ( W ) y j 1 ( 1 ) y j 1 ( 1 ) W j 1 i ( 10 ) = [ j 2 E ( W ) y j 2 ( 2 ) y j 2 ( 2 ) y j 1 ( 1 ) ] y j 1 ( 1 ) W j 1 i ( 10 ) = [ j 3 j 2 E ( W ) y j 3 ( 3 ) y j 3 ( 3 ) y j 2 ( 2 ) y j 2 ( 2 ) y j 1 ( 1 ) ] y j 1 ( 1 ) W j 1 i ( 10 ) = [ j q 1 . . . j 3 j 2 E ( W ) y j q 1 ( q 1 ) y j q 1 ( q 1 ) y j q 2 ( q 2 ) . . . y j 2 ( 2 ) y j 1 ( 1 ) ] y j 1 ( 1 ) W j 1 i ( 10 )

thus we need to visit every layer only to get E ( W ) W j 1 i ( 10 ) , when it comes to E ( W ) W j 1 i + 1 ( 10 ) ,we need to visit every layer again!

2: reverse-mode differentiation



figure6 Reverse-mode differentiation computation graph

From the graph above, only one pass we know e to all nodes. It is more efficient than Forward-mode-differentiate.
Reverse-mode differentiation apply e to every node, in our case, it is E ( W ) ;That is to say, E ( W ) y j q 1 ( q 1 ) , E ( W ) y j q 2 ( q 2 ) . . . E ( W ) y j 1 ( 1 ) are calculated in order.
Then E ( W ) W j q 1 i q 1 ( q 1 , q 2 ) , E ( W ) W j q 2 , i q 2 ( q 2 , q 3 ) . . . E ( W ) W j 1 i 1 ( 1 , 0 ) are also obtained ; As mentioned above, i m is the id of neural in m-th layer when this layer is input layer, i m can be 0 to l m ; j m is in similar way.

From all above, Reverse-mode differentiation can compute all derivatives in one single pass, that is why we use Back-propagation with reverse-mode differentiation;

Next topic will introduce some optimization techniques and how to implement these ideas with python.

猜你喜欢

转载自blog.csdn.net/qq_26386707/article/details/79343441
今日推荐