1.optimizer.minimize(loss, var_list)
TensorFlow为我们提供了丰富的优化函数,例如GradientDescentOptimizer。这个方法会自动根据loss计算对应variable的导数。示例如下:
loss = ... opt = tf.tf.train.GradientDescentOptimizer(learning_rate=0.1) train_op = opt.minimize(loss) init = tf.initialize_all_variables() with tf.Seesion() as sess: sess.run(init) for step in range(10): session.run(train_op)
看一下minimize()
的源代码(为方便说明,部分参数已删除):
1 def minimize(self, loss, global_step=None, var_list=None, name=None): 2 3 grads_and_vars = self.compute_gradients(loss, var_list=var_list) 4 5 vars_with_grad = [v for g, v in grads_and_vars if g is not None] 6 if not vars_with_grad: 7 raise ValueError( 8 "No gradients provided for any variable, check your graph for ops" 9 " that do not support gradients, between variables %s and loss %s." % 10 ([str(v) for _, v in grads_and_vars], loss)) 11 12 return self.apply_gradients(grads_and_vars, global_step=global_step, 13 name=name)
源代码可以知道minimize()
实际上包含了两个步骤,即compute_gradients
和apply_gradients
,前者用于计算梯度,后者用于使用计算得到的梯度来更新对应的variable,梯度修剪主要避免训练梯度爆炸和消失问题。下面对这两个函数做具体介绍。
1.1 computer_gradients(loss, val_list)
参数含义:
- loss: 需要被优化的Tensor
- val_list: Optional list or tuple of
tf.Variable
to update to minimizeloss
. Defaults to the list of variables collected in the graph under the keyGraphKeys.TRAINABLE_VARIABLES
.
简单说该函数就是用于计算loss对于指定val_list的导数的,最终返回的是元组列表,即[(gradient, variable),...]。
1 x = tf.Variable(initial_value=50., dtype='float32') 2 w = tf.Variable(initial_value=10., dtype='float32') 3 y = w*x 4 5 opt = tf.train.GradientDescentOptimizer(0.1) 6 grad = opt.compute_gradients(y, [w,x]) 7 with tf.Session() as sess: 8 sess.run(tf.global_variables_initializer()) 9 print(sess.run(grad))
>>> [(50.0, 10.0), (10.0, 50.0)]
可以看到返回了一个list,list中的元素是元组。第一个元组第一个元素是50,表示∂y∂w∂y∂w的计算结果,第二个元素表示ww。第二个元组同理不做赘述。
其中tf.gradients(loss, tf.variables)
的作用和这个函数类似,但是它只会返回计算得到的梯度,而不会返回对应的variable。
tf.gradients( ys, xs, grad_ys=None, name='gradients', colocate_gradients_with_ops=False, gate_gradients=False, aggregation_method=None, stop_gradients=None, unconnected_gradients=tf.UnconnectedGradients.NONE )
返回值:A list of sum(dy/dx) for each x in xs.
a = tf.constant(0.) b = 2 * a g = tf.gradients(a + b, [a, b], stop_gradients=[a, b]) 结果:[1.0, 1.0] #stop_gradients使得指定变量不被求导,即视为常量
结果对比:
with tf.Graph().as_default(): x = tf.Variable(initial_value=3., dtype='float32') w = tf.Variable(initial_value=4., dtype='float32') y = w*x grads = tf.gradients(y, [w]) print(grads) opt = tf.train.GradientDescentOptimizer(0.1) grads_vals = opt.compute_gradients(y, [w]) print(grad_vals) >>> [<tf.Tensor 'gradients/mul_grad/Mul:0' shape=() dtype=float32>] [(<tf.Tensor 'gradients_1/mul_grad/tuple/control_dependency:0' shape=() dtype=float32>, <tf.Variable 'Variable_1:0' shape=() dtype=float32_ref>)]
1.2 apply_gradients(grads_and_vars, global_step=None, name=None)
apply_gradients( grads_and_vars, global_step=None, name=None )
grads_and_vars: List of (gradient, variable) pairs as returned by compute_gradients().
global_step: Optional Variable to increment by one after the variables have been updated.
name: Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
该函数的作用是将compute_gradients()
返回的值作为输入参数对variable进行更新。
那为什么minimize()
会分开两个步骤呢?原因是因为在某些情况下我们需要对梯度做一定的修正,例如为了防止梯度消失(gradient vanishing)或者梯度爆炸(gradient explosion),我们需要事先干预一下以免程序出现Nan的尴尬情况;有的时候也许我们需要给计算得到的梯度乘以一个权重或者其他乱七八糟的原因,所以才分开了两个步骤。
with tf.Graph().as_default(): x = tf.Variable(initial_value=3., dtype='float32') w = tf.Variable(initial_value=4., dtype='float32') y = w*x opt = tf.train.GradientDescentOptimizer(0.1) grads_vals = opt.compute_gradients(y, [w]) for i, (g, v) in enumerate(grads_vals): if g is not None: grads_vals[i] = (tf.clip_by_norm(g, 5), v) # clip gradients train_op = opt.apply_gradients(grads_vals) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) print(sess.run(grads_vals)) print(sess.run([x,w,y])) >>> [(3.0, 4.0)] [3.0, 4.0, 12.0]
2.tf.clip_by_global_norm
Args:
t_list
: A tuple or list of mixed , , or None.Tensors
IndexedSlices
clip_norm
: A 0-D (scalar) > 0. The clipping ratio.Tensor
use_norm
: A 0-D (scalar) of type (optional). The global norm to use. If not provided, is used to compute the norm.Tensor
float
global_norm()
name
: A name for the operation (optional).
Returns:
list_clipped
: A list of of the same type as .Tensors
list_t
global_norm
: A 0-D (scalar) representing the global norm.Tensor
tf.clip_by_global_norm( t_list, clip_norm, use_norm=None, name=None )
3.tf.trainable_variables(), tf.all_variables(), tf.global_variables()的使用
3.1 tf.trainable_variables()
tf.trainable_variables(scope=None)
这个函数可以也仅可以查看可训练的变量,在我们生成变量时,无论是使用tf.Variable()还是tf.get_variable()生成变量,都会涉及一个参数trainable,其默认为True。以tf.Variable()为例:
__init__( initial_value=None, trainable=True, collections=None, validate_shape=True, ... )
对于一些我们不需要训练的变量,比较典型的例如学习率或者计步器这些变量,我们都需要将trainable设置为False,这时tf.trainable_variables() 就不会打印这些变量。举个简单的例子,在下图中共定义了4个变量,分别是一个权重矩阵,一个偏置向量,一个学习率和计步器,其中前两项是需要训练的而后两项则不需要。
w1=tf.Variable(tf.random_normal([255,2000]),'w1') b1=tf.get_variable('b1',[2000]) learning_rate=tf.Variable(0.5,trainable=False) global_step=tf.Variable(0,trainable=False)
tf.trainable_variables() Out[3]: [<tf.Variable 'Variable:0' shape=(255, 2000) dtype=float32_ref>, <tf.Variable 'b1:0' shape=(2000,) dtype=float32_ref>] compare: tf.global_variables() [<tf.Variable 'Variable:0' shape=(255, 2000) dtype=float32_ref>, <tf.Variable 'b1:0' shape=(2000,) dtype=float32_ref>, <tf.Variable 'Variable_1:0' shape=() dtype=float32_ref>, <tf.Variable 'Variable_2:0' shape=() dtype=int32_ref>]