Optimization algorithms -----week2
其他
2018-06-03 19:53:45
阅读次数: 1
Batch vs. mini-batch gradientt descent (1)可以分成5000个子集。 for t = 1, …., 5000 Forward prop on
x { t }
x
{
t
}
.
Z [ 1 ] = w [ 1 ] x { t } + b [ t ]
Z
[
1
]
=
w
[
1
]
x
{
t
}
+
b
[
t
]
A [ 1 ] = g [ 1 ] ( Z [ 1 ] )
A
[
1
]
=
g
[
1
]
(
Z
[
1
]
)
…..
A [ l ] = g [ l ] ( z [ l ] )
A
[
l
]
=
g
[
l
]
(
z
[
l
]
)
(2) Compute cost :
J = 1 1000 ∑ l i = 1 L ( y ^ ( i ) , y ( i ) ) + λ 2 ⋅ 1000 ∑ l ∥ w [ l ] ∥ 2 F
J
=
1
1000
∑
i
=
1
l
L
(
y
^
(
i
)
,
y
(
i
)
)
+
λ
2
⋅
1000
∑
l
‖
w
[
l
]
‖
F
2
(3) Backprop to compute gradients
J { t } ( x { t } , y { t } )
J
{
t
}
(
x
{
t
}
,
y
{
t
}
)
w [ t ] = w [ l ] − α d w [ l ]
w
[
t
]
=
w
[
l
]
−
α
d
w
[
l
]
b [ l ] = b [ l ] − α d b [ l ]
b
[
l
]
=
b
[
l
]
−
α
d
b
[
l
]
Choosing mini-batch size (1)if mini-batch size = m: the size of training set——->Batch gradient descent.(如果训练集数据大将会运行很长时间) (2) if mini-batch size = 1: Stochastic gradient descent—->Every example is its own mini-batch.(噪声大,而且最后总是在最小值附近摆动) (3)Choose In-between(minibatch size not too big/small)
Some guidelines about choosing your mini-batch size: (1)If small training(m <= 2000 ) set: Use batch gradient descent. (2)Typical mini-batch size:64, 128, 256, 512(据说
2 n
2
n
代码运行得快) (3)Make sure some mini-batch fit in CPU/GPU memory.
x [ t ] , y [ t ]
x
[
t
]
,
y
[
t
]
Exponentially weighted moving averages(指数加权滑动平均)
V t = β V t − 1 + ( 1 − β ) θ t
V
t
=
β
V
t
−
1
+
(
1
−
β
)
θ
t
β = 0.9 ≈ 1 1 − β d a y s ′ t e m p e t a u r e
β
=
0.9
≈
1
1
−
β
d
a
y
s
′
t
e
m
p
e
t
a
u
r
e
β = 0.98 ≈ 1 1 − β = 50 d a y s ′ t e m p e r a t u r e
β
=
0.98
≈
1
1
−
β
=
50
d
a
y
s
′
t
e
m
p
e
r
a
t
u
r
e
Bias correction(偏差修正) in exponentially weighted average.
V t = β V t − 1 + ( 1 − β ) θ t
V
t
=
β
V
t
−
1
+
(
1
−
β
)
θ
t
V t 1 − β t
V
t
1
−
β
t
Gradient descent with momentum(动量梯度下降法): (1) Compute
d w , d b
d
w
,
d
b
on current mini-batch.
V d w = β V d w + ( 1 − β ) d w
V
d
w
=
β
V
d
w
+
(
1
−
β
)
d
w
V θ = β V θ + ( 1 − β ) θ t
V
θ
=
β
V
θ
+
(
1
−
β
)
θ
t
V d b = β V d b + ( 1 − β ) d b
V
d
b
=
β
V
d
b
+
(
1
−
β
)
d
b
(2) Update
w , b
w
,
b
:
w = w − α V d w
w
=
w
−
α
V
d
w
b = b − α V d b
b
=
b
−
α
V
d
b
使得梯度下降法在垂直方向的震荡幅度变小,水平方向的移动更快速,以更快速度进行梯度下降。 (3)Implementation details:
V d w = 0 , V d b = 0
V
d
w
=
0
,
V
d
b
=
0
On iteration t: Compute
d W , d b
d
W
,
d
b
on the current mini-batch.
v d W = β v d W + ( 1 − β ) d W
v
d
W
=
β
v
d
W
+
(
1
−
β
)
d
W
v d b = β v d b + ( 1 − β ) d b
v
d
b
=
β
v
d
b
+
(
1
−
β
)
d
b
W = W − α v d W , b = b − α v d b
W
=
W
−
α
v
d
W
,
b
=
b
−
α
v
d
b
Hyperparameters:
α
α
,
β
β
,
β = 0.9
β
=
0.9
(4) RMSprop(root mean square prop)(均方根传递) On iteration t: Compute
d w , d b
d
w
,
d
b
on current mini-batch
S d W = β S d W + ( 1 − β ) ( d W ) 2
S
d
W
=
β
S
d
W
+
(
1
−
β
)
(
d
W
)
2
:
( d W ) 2 , e l e m e n t − w i s e
(
d
W
)
2
,
e
l
e
m
e
n
t
−
w
i
s
e
S d b = β S d b + ( 1 − β ) ( d b ) 2
S
d
b
=
β
S
d
b
+
(
1
−
β
)
(
d
b
)
2
Update:
W = W − α d W S d W √ + ε
W
=
W
−
α
d
W
S
d
W
+
ε
b = b − α d b S d b √ + ε
b
=
b
−
α
d
b
S
d
b
+
ε
(5) Adam optimization algorithms
V d W = 0 , S d W = 0 , V d b = 0 , S d b = 0
V
d
W
=
0
,
S
d
W
=
0
,
V
d
b
=
0
,
S
d
b
=
0
On iteration t: Compute
d W , d b
d
W
,
d
b
using current mini-batch.(mini-batch gradient) “momentum”:
V d W = β 1 V d W + ( 1 − β 1 ) d W , V d b = β 1 V d b + ( 1 − β 1 ) d b
V
d
W
=
β
1
V
d
W
+
(
1
−
β
1
)
d
W
,
V
d
b
=
β
1
V
d
b
+
(
1
−
β
1
)
d
b
“RMSprop”:
S d w = β 2 S d w + ( 1 − β 2 ) ( d W ) 2 , S d b = β 2 S d b + ( 1 − β 2 ) d b
S
d
w
=
β
2
S
d
w
+
(
1
−
β
2
)
(
d
W
)
2
,
S
d
b
=
β
2
S
d
b
+
(
1
−
β
2
)
d
b
Bias corrected:
v c o r r e c t e d d w = V d W 1 − β t 1 ,
v
d
w
c
o
r
r
e
c
t
e
d
=
V
d
W
1
−
β
1
t
,
V c o r r e c t e d d b = V d b 1 − β t 1
V
d
b
c
o
r
r
e
c
t
e
d
=
V
d
b
1
−
β
1
t
S c o r r e c t e d d W = S d w 1 − β t 2 ,
S
d
W
c
o
r
r
e
c
t
e
d
=
S
d
w
1
−
β
2
t
,
S c o r r e c t e d d b = S d b 1 − β t 2
S
d
b
c
o
r
r
e
c
t
e
d
=
S
d
b
1
−
β
2
t
W = W − α V c o r r e c t e d d w S c o r r e c t e d d w + √ + ε
W
=
W
−
α
V
d
w
c
o
r
r
e
c
t
e
d
S
d
w
c
o
r
r
e
c
t
e
d
+
+
ε
b = b − α V c o r r e c t e d d b S c o r r e c t e d d b √ +
b
=
b
−
α
V
d
b
c
o
r
r
e
c
t
e
d
S
d
b
c
o
r
r
e
c
t
e
d
+
Hyperparameters choice:
α : n e e d s t o b e t u n e
α
:
n
e
e
d
s
t
o
b
e
t
u
n
e
β 1 : 0.9 ( d W )
β
1
:
0.9
(
d
W
)
β 2 : 0.999 ( d W 2 ) e l e m e n t − w i s e
β
2
:
0.999
(
d
W
2
)
e
l
e
m
e
n
t
−
w
i
s
e
ε : 10 − 8
ε
:
10
−
8
Learning rate decay epoch: 迭代次数
α = 1 1 + d e c a y _ r a t e ⋅ e p o c h _ n u m b e r s ⋅ α 0
α
=
1
1
+
d
e
c
a
y
_
r
a
t
e
⋅
e
p
o
c
h
_
n
u
m
b
e
r
s
⋅
α
0
8.Local optima in neural network
转载自 blog.csdn.net/qq_31805127/article/details/79788571