Lecture 4 | The Backpropagation

2019-10-19  本文已影响0人  Ysgc

vector activation vs scalar activation

\frac{\partial y_i}{\partial z_i} = \frac{exp(z_i)}{\sum_j exp(z_j)} - \frac{exp(z_i)^2}{(\sum_j exp(z_j))^2} = y_i(1 - y_i)

\frac{\partial y_i}{\partial z_j} = - \frac{exp(z_i)exp(z_j)}{(\sum_j exp(z_j))^2} = y_i y_j

sigmoid output -> prob of classification

how to define the error???

first choice: square euclidean distance

L2 divergence -> differentiation is just y_i - d_i

gradient<0 => y_i should increase to reduce the div

arithmetically wrong, but label smoothing will help gradient descent!

avoid overshooting

https://leimao.github.io/blog/Label-Smoothing/

it's a heuristic

forward NN

backward NN

(1) trivial: grad of output


(2) grad of the final activation layer



(3) grad of the last group of weights

[grad (W_{ij}^n)] = [y_0^{n-1}, y_1^{n-1}, ...,y_i^{n-1}]^T\cdot [grad(Z_0^n),grad(Z_1^n),...,grad(Z_j^n)]

(4) grad of the second last group of y


[grad (y_{i}^{n-1})]^T = [W_{ij}]\cdot [grad(Z_0^n),grad(Z_1^n),...,grad(Z_j^n)]^T

(5) 综上 pseudocode & backward forward comparision


backward: in each loop, apply an affine transformation (transposed W) to the derivative, then times the derivative of the activation func forward: in each iteration, apply an affine transformation to the input and an activation function
step (2) no longer element wise multiplication

[grad(z_0),grad(z_1),...grad(z_i)]^T = [\frac{\partial y_j}{\partial z_i}]\times [grad(y_0),grad(y_1),...grad(y_i)]^T


上一篇 下一篇

猜你喜欢

热点阅读