Lecture 6 | Acceleration, Regula

2019-10-20  本文已影响0人  Ysgc

off policy ~> having a suggested step*grad at each step, but using the average of history instead to move

Vanilla GD Momentum

assume weights take one more step with momentum, calculate the grad then, and use that predicted grad as the current grad for updating

blue: momentum; green: Nesterov
这里好像不太对啊,每个do until loop里面先那还怎么在第五行更新Wk呢

In Momentum => no assumption of the Hessian to be diagonal


A brief summary of different optimization methods in DNN


GD -> SGD

in each iteration, with the number of instances or batches increasing, the LR should shrink!! otherwise, bouncing all the time

criterion of how far away from the optima

converge faster: O(log(1/\epsilon)) << log(1/\epsilon)
but SGD updates with each instance

Batch GD converges faster and variance of error is lower

f(x) -> target func
g(x;w) -> current NN
sample from f(x) -> min empirical error

having difference set of samples -> different updating behavior -> the meaning of variance of estimation

https://www.stat.cmu.edu/~ryantibs/convexopt/lectures/stochastic-gd.pdf

上一篇下一篇

猜你喜欢

热点阅读