实用机器学习笔记（八）：SGD + MLP + CNN + RNN简介

文章目录

1. Mini-batch stochastic gradient descnet (SGD)

Train by mini-batch SGD

w model param, b batch size, $\eta_t$ learning rate at time t
randomly initialize $w_1$
repeat t = 1,2…until converge
- randomly samples $I_t \in {1,...,n}$ with $I_t$ = b
- update $w_{t+1} = w_t -\eta_t \triangledown_{w_t}l(w_{I_t},y_{I_t},w_t)$

sensitive the hyper-parameters b and $\eta_t$ (SGD对偏置b和学习率非常敏感)

2. Linear Methods -> Multilayer Perception（MLP）

MLP中的一些常见名词
(1) a dense (full connected ,or linear)layer has parameters $\in R^{m\times n}$
it computes output $\in R^{m}$
(2) linear regression : dense layer with 1 output
(3) softmax regrsson : dense layer with m output + softmax
(4) activation is a elemental-wise non-linear function
$\frac{1}{1+exp(-x)}$
$r e l u = m a x (x, 0)$
(5) stack multiple hiddent layers (dense + activation) to get deeper models
(6) hyper-parameters : hidden layers and the outputs for each hidden layer

3. Dense Layer --> Convolution layer （CNN）

The problem of dense layer

(1) learn imageNet(300*300 image with 1k classes) by a MLP with a single hidden layer with 10k output （参数量太大）

it leads to 1 billion learnable parameters,which is too big!
fully connected : an output is a weighted sum over all inputs

(2) recognize objectes in images:(图片识别问题，局部信息、平移问题)

Translation invariance:similar output (no matter where the object is)
Locality: pixels are more related to near neighbor

(3) build the prior knowledege into the model structure（在模型中加入先验信息以减少参数量）

achieve same model capacity with less params

Convolution layer

(1) locality : an output is computed from k*k input windows (感受野)

(2) translation invariant : output use the same k*k weights(kernel) （平移不变性）

(3) model params of a convlution layer does not depend on input/output sizes（参数量独立）

(4)a kernel may learn to identify a patten（一个kernel学习一个模式）

"""Convolution with single input and output channels"""
# both input 'X' and weight 'K' are matrix
h,w = K.shape # the size of kernel:height and width
Y = torch.zeros((X.shape[0] - h + 1,X.shape[1] -w +1)) # the convolution result

for i in range(Y.shape[0]):
    for j in range(Y.shape[1]):
        Y[i,j] = (X[i:i+h,j:j+w]*K).sum()

Pooling Layer

(1) convolution is sensitive to location

a pixel shift in the input result in a pixel shift in output
a pooling layer computes mean/max in k*k windows

# h,w: pooling window height and width
# mode: max or avg
Y = torch.zeros((X.shape(0)-h+1,X.shape[1]-w+1))
for i in range(Y.shape[0]):
    for j in range(Y.shape[1]):
        if mode=='max':
            Y[i,j] = X[i:i+h,j:j+w].max()
        elif mode=='avg':
            Y[i,j] = X[i:i+h,j:j+w].mean()

Convolution Neural Network(CNN): 有参考文献，可以白嫖

A neural network uses stack of convolution layers to extract features
- activation is applied after each convolution layer
- using pooling to reduce location sensiticity
modern CNNs are deep neural network with various hyper-parameters and layer connections

4. Dense layer --> Recurrent network （RNN）

The problem of dense layer

language model: predicte the next word
- hello --> world
- Hello world --> !
use MLP naively does’t handle sequence infomation well
- the input/output don’t have the same length

RNN and Gate RNN : 看参考文献

动手学深度学习（三十七）——循环神经网络

simple RNN :
$h_t = \phi(W_{hh}h_{t-1} + W_{hx}X_t + b_h)$

Gated RNN(LSTM and GRU):finer control of information flow
- Forget input : suppress $x_T$ when computing $h_t$
- Forget past: suppress $h_{t-1}$ when computing $h_t$
- 动手学深度学习（三十九）——门控循环单元GRU
- 动手学深度学习（四十）——长短期记忆网络（LSTM）
- 动手学深度学习（四十一）——深度循环神经网络（Deep-RNN
- 动手学深度学习（四十二）——双向循环神经网络（bi-RNN）

5. Summary

MLP : stack dense layers with non-linear activations
CNN : stack convolution activation and pooling layers to efficient extract spatial information
RNN : stack recurrent layers to pass temporal information throught hiddens state