Tensorflow训练时增加learning rate dec
关于是不是需要做learning rate decay,以及选取多大的learning rate,跟优化器(Optimizer)、Batchsize以及任务本身有关。
一般而言,选用tf.train.Momentum.Optimizer需要配合lr decay;
选用tf.train.Adam.Optimizer不需要lr decay。
但是关于Adam是否需要做learning rate decay有很多说法:
如Should we do learning rate decay for adam optimizer?
摘取优秀回答:
It depends. ADAM updates any parameter with an individual learning rate. This means that every parameter in the network have a specific learning rate associated. But the single learning rate for parameter is computed using lambda (the initial learning rate) as upper limit. This means that every single learning rate can vary from 0 (no update) to lambda (maximum update). The learning rates adapt themselves during train steps, it's true, but if you want to be sure that every update step do not exceed lambda you can than lower lambda using exponential decay or whatever. It can help to reduce loss during the latest step of training, when the computed loss with the previously associated lambda parameter has stopped to decrease.
def _optimize(self, loss, global_):
# optimizer = tf.train.MomentumOptimizer(
# learning_rate=0.001, momentum=0.9)
learning_rate = tf.train.exponential_decay(self.config.init_lr, global_step=global_,
decay_steps=self.config.decay_step,
decay_rate=self.config.decay_rate,
staircase=True)
tf.summary.scalar("learning_rate", learning_rate)
optimizer = tf.train.AdamOptimizer(learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8,
use_locking=False, name="Adam")
trainable_var = tf.trainable_variables()
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
grads_and_vars = optimizer.compute_gradients(loss, trainable_var)
grads_and_vars = [(tf.clip_by_norm(g, self.config.GRADIENT_CLIP_NORM), v)
for g, v in grads_and_vars]
with tf.control_dependencies(update_ops):
# apply_gradient_op = optimizer.minimize(loss)
apply_gradient_op = optimizer.apply_gradients(grads_and_vars, name='train_op')
placeholder_float32 = tf.constant(0, dtype=tf.float32)
# tf.summary.scalar("accuracy", rate)
return placeholder_float32, apply_gradient_op, apply_gradient_op, learning_rate
这里global_
应该是一个tensor 常量,通过训练时更新迭代步数来得到step
, 然后通过feed_dict
将step
喂给global_
,即可以达到lr decay
的效果。
这里将
staircase
设置为True
,可以看到阶梯状lr下降趋势。