Lecture 4 | The Backpropagation
2019-10-19 本文已影响0人
vector activation vs scalar activation
sigmoid output -> prob of classification
how to define the error???
first choice: square euclidean distance
L2 divergence -> differentiation is just
gradient<0 => y_i should increase to reduce the div
arithmetically wrong, but label smoothing will help gradient descent!
avoid overshooting
it's a heuristic
forward NN
backward NN
(1) trivial: grad of output
(2) grad of the final activation layer
(3) grad of the last group of weights
(4) grad of the second last group of y
(5) 综上 pseudocode & backward forward comparision