Thus, perceptron learning corresponds to Gradient Descent of a quadratic error function.
effor function more details:
sequential updating and delta rule:
Gradient descent
1.3 properties of 2 layers perceptron
it can only represent the linear function since
yj(x)=∑i=0dWji∗xior∑i=0dWji∗ϕ(xi)
the discriminant boundary is always linear in input space x or input space ϕ(x) when input layer applied with ϕ(x), to be specific, the boundary can be a line, a plane and can not be a curve and so on. However, multi layers perceptron with hidden units can represent any continuous functions.⇒2. multi layers perceptron
ϕ(x) and g(x) are given before; They are fixed functions.
There is always bias item in the linear discriminant function; (y = ax+b,b is the bias item and it has nothing to do with input x), thus the input layer always have d+1 input neural and the x0 is always 1, in a result y=ax1+bx0,x1=x,d=1;
2 multi layers perceptron
There are some hidden layers between input layer and output layer. For example, perceptron with one hidden layer as followed,
figure2 the construction of multi layers perceptron
output:
yk(x)=g(2)[∑i=0hW(2)kig(1)[∑j=0dW(1)ijxj]]
In 1.2 we know how to learn the weight of 2 layers perceptron. As the same way, for multi layers, we also need to find the error function and using Gradient Decent to update all weights, but computing the gradient is more complex. So here are 2 main steps:
step1: computing the gradient ⇒2.1Backpropogation step2: adjusting the weight in the direction of gradient, same as 1.2 step3, we well later focus on some optimization techniques to improve the performance ⇒Machine Learning(7)Neural network–optimization techniques
2.1 Backpropagation
2.1.1 How to use backpropagation
figure3 the construction of multi layers perceptron
if the id of layer is m, n and q form top to bottom, the number of neural in each layer is lm,lnand lq; between 2 layers, the above layer is always the output layer with index of neural j and similarly, the bottom layer is always the input layer with i;
Our goal is to obtain the gradient of ∂E(W)∂Wji(mn):
Thus, 3 gradients above needs to be calculated between every 2 adjacent layers. Once we got the
∂E(W)∂y(m)j
from the above m+1 layer then we can obtain
∂E(W)∂Wji(mn)
and calculate the
∂E(W)∂y(n)i
to prepare the next layer down calculation for
∂E(W)∂z(q)j
; It is called reverse-mode differentiation.
2.1.2 Why use backpropagation with reverse-mode differentiation
For all adjacent layers m and n, There are 2 ways to calculate ∂E(W)∂W(mn)ij. To simplify, suppose we want to calculate ∂Z∂X, one way is to apply operator∂∂X to every node, which is called Forward-Mode Differentiation. The other way is to apply operator∂Z∂ called reverse-mode differentiation;
Visiting all red lines only get one gradient which is not efficient.
Forward-mode-differentiate apply operator∂∂b to every node, in our cases, the operator is ∂∂y(m)j if the goal is to obtain ∂E(W)∂W(mm−1)jmi;the id of first layer down is 0;
From the graph above, only one pass we know ∂e∂ to all nodes. It is more efficient than Forward-mode-differentiate. Reverse-mode differentiation apply ∂e∂ to every node, in our case, it is ∂E(W)∂ ;That is to say, ∂E(W)∂y(q−1)jq−1,∂E(W)∂y(q−2)jq−2...∂E(W)∂y(1)j1 are calculated in order. Then ∂E(W)∂W(q−1,q−2)jq−1iq−1,∂E(W)∂W(q−2,q−3)jq−2,iq−2...∂E(W)∂W(1,0)j1i1are also obtained ; As mentioned above, im is the id of neural in m-th layer when this layer is input layer, imcan be 0 to lm; jm is in similar way.
From all above, Reverse-mode differentiation can compute all derivatives in one single pass, that is why we use Back-propagation with reverse-mode differentiation;
Next topic will introduce some optimization techniques and how to implement these ideas with python.