1. solver文件介绍
solver文件是训练网络所必须要的文件,其中定义了诸如:求解器类型、学习率、学习率的变化策略等。其命令行调用方式模式一般为:
caffe train --solver=*_slover.prototxt
接下来看一个solver配置文件的例子:
train_net: "train.prototxt"
test_net: "val.prototxt"
test_iter: 100
test_interval: 938
base_lr: 0.00999999977648
display: 1000
max_iter: 9380
lr_policy: "step"
gamma: 0.10000000149
momentum: 0.899999976158
weight_decay: 0.000500000023749
stepsize: 3000
snapshot: 4000
snapshot_prefix: "./snapshot/"
solver_mode: GPU
type: "SGD"
下面的内容将会详细讲解其中的参数,并介绍其其使用方法
2. solver文件详细介绍
2.1 solver运行流程
solver的运行流程可以分为如下步骤:
1. 设计好需要优化的对象,以及用于学习的训练网络和用于评估的测试网络。(通过调用另外一个配置文件prototxt来进行)
2. 通过forward和backward迭代的进行优化来跟新参数。
3. 定期的评价测试网络。 (可设定多少次训练后,进行一次测试)
4. 在优化过程中显示模型和solver的状态
在每一次的迭代过程中,solver做了这几步工作:
1. 调用forward算法来计算最终的输出值,以及对应的loss
2. 调用backward算法来计算每层的梯度
3. 根据选用的slover方法,利用梯度进行参数更新
4. 记录并保存每次迭代的学习率、快照,以及对应的状态。
参考:链接
2.2 参数解释
这里就使用上面的例子进行说明,首先指定训练和测试网络模型的文件
Part 1
train_net: "train.prototxt"
test_net: "val.prototxt"
Part 2
之后的参数是test_iter,之后讲到其计算的方式
test_iter: 100
该参数需要训练的总文件个数N和训练网络模型文件中定义的batch_size相结合,在test_iter指定的迭代次数中需要跑完所有的训练数据,这里称之为一个epoch。那么其取值可以表述为下面的公式:
接下来的参数是test_interval,既是测试间隔,就是规定迭代多少次之后才开始进行测试。
Part 3
test_interval: 938
该参数的值也是与测试的总文件个数M和测试网络模型文件中定义的batch_size相结合确定的。其取值公式为:
Part 4
display参数指定了训练多少次,就要打印一下信息,要是不需要显示的话可以设置为0
display: 1000
这里便是迭代1000次之后显示一次。
Part 5
接下来便是重要的部分,学习率的设置及其变化策略部分
base_lr: 0.00999999977648 # 基础学习率
lr_policy: "step" # 学习率变化的策略
gamma: 0.10000000149 # 学习速率衰减常数,每次更新学习速率都是乘上这个固定常数
momentum: 0.899999976158 # 动量,遗忘因子
weight_decay: 0.000500000023749 # 用于防止过拟合,乘以权值惩罚项
stepsize: 3000 # 学习率变化的步长,不能太小,如果太小会导致学习率再后来越来越小,达不到充分收敛的效果。
其中 就是上面提到的gamma, 就是上面提到的当前学习率。接下来谈一谈学习率变化的策略,caffe中的策略选择是通过lr_policy实现的,其取值为:
- fixed: 保持base_lr不变
- step:如果设置为step,则还需要设置一个stepsize, 返回 base_lr * gamma ^ (floor(iter / stepsize)),其中iter表示当前的迭代次数
- exp:返回base_lr * gamma ^ iter, iter为当前迭代次数
- inv: 如果设置为inv,还需要设置一个power, 返回base_lr * (1 + gamma * iter) ^ (- power)
- multistep:如果设置为multistep,则还需要设置一个stepvalue。这个参数和step很相似,step是均匀等间隔变化,而multistep则是根据stepvalue值变化
- poly:学习率进行多项式误差, 返回 base_lr (1 - iter/max_iter) ^ (power)
- sigmoid:学习率进行sigmod衰减,返回 base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
上面设置的学习率是按照迭代步长来进行变化的,这里也可以设置为按照固定的步长值来进行设置,设置的示例为:
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "multistep"
gamma: 0.9
stepvalue: 5000
stepvalue: 7000
stepvalue: 8000
stepvalue: 9000
stepvalue: 9500
Part 6
该参数指定最大的迭代次数,但是在python接口中使用step()函数,结合for循环这个参数也没有意义
max_iter: 9380
Part 7
接下来是迭代模型的保存迭代间隔和保存路径
snapshot: 4000 # 参数模型保存迭代间隔
snapshot_prefix: "./snapshot/" # 参数模型保存路径
Part 8
求解的设备选择,默认为GPU
solver_mode: GPU
Part 9
求解器的类型选择,默认就是SGD
type: "SGD"
可供选择的求解器类型:
enum SolverType {
SGD = 0;
NESTEROV = 1;
ADAGRAD = 2;
RMSPROP = 3;
ADADELTA = 4;
ADAM = 5;
}
3. SolverParameter定义
这里提出完整的SolverParameter的定义,以供查阅…
// NOTE
// Update the next available ID when you add a new SolverParameter field.
//
// SolverParameter next available ID: 42 (last added: layer_wise_reduce)
message SolverParameter {
//////////////////////////////////////////////////////////////////////////////
// Specifying the train and test networks
//
// Exactly one train net must be specified using one of the following fields:
// train_net_param, train_net, net_param, net
// One or more test nets may be specified using any of the following fields:
// test_net_param, test_net, net_param, net
// If more than one test net field is specified (e.g., both net and
// test_net are specified), they will be evaluated in the field order given
// above: (1) test_net_param, (2) test_net, (3) net_param/net.
// A test_iter must be specified for each test_net.
// A test_level and/or a test_stage may also be specified for each test_net.
//////////////////////////////////////////////////////////////////////////////
// Proto filename for the train net, possibly combined with one or more
// test nets.
optional string net = 24;
// Inline train net param, possibly combined with one or more test nets.
optional NetParameter net_param = 25;
optional string train_net = 1; // Proto filename for the train net.
repeated string test_net = 2; // Proto filenames for the test nets.
optional NetParameter train_net_param = 21; // Inline train net params.
repeated NetParameter test_net_param = 22; // Inline test net params.
// The states for the train/test nets. Must be unspecified or
// specified once per net.
//
// By default, train_state will have phase = TRAIN,
// and all test_state's will have phase = TEST.
// Other defaults are set according to the NetState defaults.
optional NetState train_state = 26;
repeated NetState test_state = 27;
// The number of iterations for each test net.
repeated int32 test_iter = 3;
// The number of iterations between two testing phases.
optional int32 test_interval = 4 [default = 0];
optional bool test_compute_loss = 19 [default = false];
// If true, run an initial test pass before the first iteration,
// ensuring memory availability and printing the starting value of the loss.
optional bool test_initialization = 32 [default = true];
optional float base_lr = 5; // The base learning rate
// the number of iterations between displaying info. If display = 0, no info
// will be displayed.
optional int32 display = 6;
// Display the loss averaged over the last average_loss iterations
optional int32 average_loss = 33 [default = 1];
optional int32 max_iter = 7; // the maximum number of iterations
// accumulate gradients over `iter_size` x `batch_size` instances
optional int32 iter_size = 36 [default = 1];
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.
optional string lr_policy = 8;
optional float gamma = 9; // The parameter to compute the learning rate.
optional float power = 10; // The parameter to compute the learning rate.
optional float momentum = 11; // The momentum value.
optional float weight_decay = 12; // The weight decay.
// regularization types supported: L1 and L2
// controlled by weight_decay
optional string regularization_type = 29 [default = "L2"];
// the stepsize for learning rate policy "step"
optional int32 stepsize = 13;
// the stepsize for learning rate policy "multistep"
repeated int32 stepvalue = 34;
// Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
// whenever their actual L2 norm is larger.
optional float clip_gradients = 35 [default = -1];
optional int32 snapshot = 14 [default = 0]; // The snapshot interval
optional string snapshot_prefix = 15; // The prefix for the snapshot.
// whether to snapshot diff in the results or not. Snapshotting diff will help
// debugging but the final protocol buffer size will be much larger.
optional bool snapshot_diff = 16 [default = false];
enum SnapshotFormat {
HDF5 = 0;
BINARYPROTO = 1;
}
optional SnapshotFormat snapshot_format = 37 [default = BINARYPROTO];
// the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
enum SolverMode {
CPU = 0;
GPU = 1;
}
optional SolverMode solver_mode = 17 [default = GPU];
// the device_id will that be used in GPU mode. Use device_id = 0 in default.
optional int32 device_id = 18 [default = 0];
// If non-negative, the seed with which the Solver will initialize the Caffe
// random number generator -- useful for reproducible results. Otherwise,
// (and by default) initialize using a seed derived from the system clock.
optional int64 random_seed = 20 [default = -1];
// type of the solver
optional string type = 40 [default = "SGD"];
// numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
optional float delta = 31 [default = 1e-8];
// parameters for the Adam solver
optional float momentum2 = 39 [default = 0.999];
// RMSProp decay value
// MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
optional float rms_decay = 38 [default = 0.99];
// If true, print information about the state of the net that may help with
// debugging learning problems.
optional bool debug_info = 23 [default = false];
// If false, don't save a snapshot after training finishes.
optional bool snapshot_after_train = 28 [default = true];
// DEPRECATED: old solver enum types, use string instead
enum SolverType {
SGD = 0;
NESTEROV = 1;
ADAGRAD = 2;
RMSPROP = 3;
ADADELTA = 4;
ADAM = 5;
}
// DEPRECATED: use type instead of solver_type
optional SolverType solver_type = 30 [default = SGD];
// Overlap compute and communication for data parallel training
optional bool layer_wise_reduce = 41 [default = true];
}