文章目录
1. Mini-batch stochastic gradient descnet (SGD)
Train by mini-batch SGD
w
model param,b
batch size, η t \eta_t ηt learning rate at time t- randomly initialize w 1 w_1 w1
- repeat t = 1,2…until converge
- randomly samples I t ∈ 1 , . . . , n I_t \in {1,...,n} It∈1,...,n with I t I_t It = b
- update w t + 1 = w t − η t ▽ w t l ( w I t , y I t , w t ) w_{t+1} = w_t -\eta_t \triangledown_{w_t}l(w_{I_t},y_{I_t},w_t) wt+1=wt−ηt▽wtl(wIt,yIt,wt)
sensitive the hyper-parameters b
and η t \eta_t ηt (SGD对偏置b和学习率非常敏感)
2. Linear Methods -> Multilayer Perception(MLP)
MLP中的一些常见名词
(1) a dense (full connected ,or linear)layer has parameters W ∈ R m × n W \in R^{m\times n} W∈Rm×n
it computes output y = W x + b ∈ R m y = Wx + b \in R^{m} y=Wx+b∈Rm
(2) linear regression : dense layer with 1 output
(3) softmax regrsson : dense layer with m output + softmax
(4) activation is a elemental-wise non-linear function
s i g m o i d ( x ) = 1 1 + e x p ( − x ) sigmoid(x) = \frac{1}{1+exp(-x)} sigmoid(x)=1+exp(−x)1
r e l u = m a x ( x , 0 ) relu = max(x,0) relu=max(x,0)
(5) stack multiple hiddent layers (dense + activation) to get deeper models
(6) hyper-parameters : hidden layers and the outputs for each hidden layer
3. Dense Layer --> Convolution layer (CNN)
The problem of dense layer
(1) learn imageNet(300*300 image with 1k classes) by a MLP with a single hidden layer with 10k output (参数量太大)
- it leads to 1 billion learnable parameters,which is too big!
- fully connected : an output is a weighted sum over all inputs
(2) recognize objectes in images:(图片识别问题,局部信息、平移问题)
- Translation invariance:similar output (no matter where the object is)
- Locality: pixels are more related to near neighbor
(3) build the prior knowledege into the model structure(在模型中加入先验信息以减少参数量)
- achieve same model capacity with less params
Convolution layer
(1) locality : an output is computed from k*k input windows (感受野)
(2) translation invariant : output use the same k*k weights(kernel) (平移不变性)
(3) model params of a convlution layer does not depend on input/output sizes(参数量独立)
(4)a kernel may learn to identify a patten(一个kernel学习一个模式)
"""Convolution with single input and output channels"""
# both input 'X' and weight 'K' are matrix
h,w = K.shape # the size of kernel:height and width
Y = torch.zeros((X.shape[0] - h + 1,X.shape[1] -w +1)) # the convolution result
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i,j] = (X[i:i+h,j:j+w]*K).sum()
Pooling Layer
(1) convolution is sensitive to location
-
a pixel shift in the input result in a pixel shift in output
-
a pooling layer computes mean/max in k*k windows
# h,w: pooling window height and width
# mode: max or avg
Y = torch.zeros((X.shape(0)-h+1,X.shape[1]-w+1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
if mode=='max':
Y[i,j] = X[i:i+h,j:j+w].max()
elif mode=='avg':
Y[i,j] = X[i:i+h,j:j+w].mean()
Convolution Neural Network(CNN): 有参考文献,可以白嫖
-
A neural network uses stack of convolution layers to extract features
- activation is applied after each convolution layer
- using pooling to reduce location sensiticity
-
modern CNNs are deep neural network with various hyper-parameters and layer connections
- Alexnet 动手学深度学习(十九)——AlexNet:CNN经典网络(二)更深+更大
- VGG 动手学深度学习(二十)——VGG网络(2014年ILSVRC竞赛第二名模型)
- Inception 动手学深度学习(二十二)——GoogLeNet:CNN经典模型(五)
- Resnet 动手学深度学习(二十四)——公式详解ResNet
- MobileNet
4. Dense layer --> Recurrent network (RNN)
The problem of dense layer
-
language model: predicte the next word
- hello --> world
- Hello world --> !
-
use MLP naively does’t handle sequence infomation well
- the input/output don’t have the same length
RNN and Gate RNN : 看参考文献
- simple RNN :
h t = ϕ ( W h h h t − 1 + W h x X t + b h ) h_t = \phi(W_{hh}h_{t-1} + W_{hx}X_t + b_h) ht=ϕ(Whhht−1+WhxXt+bh)
- Gated RNN(LSTM and GRU):finer control of information flow
- Forget input : suppress x T x_T xT when computing h t h_t ht
- Forget past: suppress h t − 1 h_{t-1} ht−1 when computing h t h_t ht
- 动手学深度学习(三十九)——门控循环单元GRU
- 动手学深度学习(四十)——长短期记忆网络(LSTM)
- 动手学深度学习(四十一)——深度循环神经网络(Deep-RNN
- 动手学深度学习(四十二)——双向循环神经网络(bi-RNN)
5. Summary
- MLP : stack dense layers with non-linear activations
- CNN : stack convolution activation and pooling layers to efficient extract spatial information
- RNN : stack recurrent layers to pass temporal information throught hiddens state