Lecture 6 | Acceleration, Regula
2019-10-20 本文已影响0人
Ysgc
off policy ~> having a suggested step*grad at each step, but using the average of history instead to move
Vanilla GD Momentumassume weights take one more step with momentum, calculate the grad then, and use that predicted grad as the current grad for updating
blue: momentum; green: Nesterov这里好像不太对啊,每个do until loop里面先那还怎么在第五行更新Wk呢
In Momentum => no assumption of the Hessian to be diagonal
A brief summary of different optimization methods in DNN
GD -> SGD
in each iteration, with the number of instances or batches increasing, the LR should shrink!! otherwise, bouncing all the timecriterion of how far away from the optima
- abs(w - w*), euclidean distance
- step decreasing
converge faster: <<
but SGD updates with each instance
Batch GD converges faster and variance of error is lower
f(x) -> target func
g(x;w) -> current NN
sample from f(x) -> min empirical error
having difference set of samples -> different updating behavior -> the meaning of variance of estimation
https://www.stat.cmu.edu/~ryantibs/convexopt/lectures/stochastic-gd.pdf