Lecture 4 | The Backpropagation

2019-10-19 本文已影响0人 Ysgc

vector activation vs scalar activation

$\frac{\partial y_i}{\partial z_i} = \frac{exp(z_i)}{\sum_j exp(z_j)} - \frac{exp(z_i)^2}{(\sum_j exp(z_j))^2} = y_i(1 - y_i)$

$\frac{\partial y_i}{\partial z_j} = - \frac{exp(z_i)exp(z_j)}{(\sum_j exp(z_j))^2} = y_i y_j$

sigmoid output -> prob of classification

how to define the error???

first choice: square euclidean distance

L2 divergence -> differentiation is just $y_i - d_i$

gradient<0 => y_i should increase to reduce the div

arithmetically wrong, but label smoothing will help gradient descent!

avoid overshooting

https://leimao.github.io/blog/Label-Smoothing/

it's a heuristic

forward NN

backward NN

(1) trivial: grad of output

(2) grad of the final activation layer

(3) grad of the last group of weights

$[grad (W_{ij}^n)] = [y_0^{n-1}, y_1^{n-1}, ...,y_i^{n-1}]^T\cdot [grad(Z_0^n),grad(Z_1^n),...,grad(Z_j^n)]$

(4) grad of the second last group of y

$[grad (y_{i}^{n-1})]^T = [W_{ij}]\cdot [grad(Z_0^n),grad(Z_1^n),...,grad(Z_j^n)]^T$

(5) 综上 pseudocode & backward forward comparision

backward: in each loop, apply an affine transformation (transposed W) to the derivative, then times the derivative of the activation func

forward: in each iteration, apply an affine transformation to the input and an activation function

step (2) no longer element wise multiplication

$[grad(z_0),grad(z_1),...grad(z_i)]^T = [\frac{\partial y_j}{\partial z_i}]\times [grad(y_0),grad(y_1),...grad(y_i)]^T$

上一篇下一篇

猜你喜欢

热点阅读