大数据,机器学习,人工智能

9. Neural Networks: Learing

2020-08-19  本文已影响0人  玄语梨落

Neural Networks: Learing

Cost Function

L = total number of layers in network
s_j = no. of units (not counting bias unit) in layer l. K = s_L

J(\Theta) = -\frac{1}{m}[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}log(h_\Theta(x^{(i)}))_k+(1-y_k^{(i)})log(1-(h_\Theta(x^{(i)}))_k)]+\\ \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{ji}^{(l)})^2 \newline h_\Theta(x)\in R^k \qquad (h_\Theta(x))_i=i^{th}

Backpropagation algorithm

\delta_j^{(l)} = "error" of node j in layer l.

For each output unit (layer L = 4)

  1. \delta_j^{(4)} = a_j^{(4)} - y_j
  2. \delta^{(4)} = a^{(4)}-y
  3. \delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)})(g'(z^{(2)}) = a^{(3)}.*(1-a^{(3)}))
  4. \delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)})

Backpropagation algorithm

\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}

Vectorized implementation:

\Delta^{(l)} := \Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T

D_{ij}^{(l)} = \Delta_{ij}^{(l)}+\lambda\Theta_{ij}^{(l)} \ (j\ne 0)
\frac{\partial}{\partial\Theta_{ij}^{(l)}}J(\Theta) = D_{ij}^{(l)}

Backpropagation intuition

Forward Propagation

Understand what Backpropagation does.

Implementation note: Unrolling parameters

function [jVal, gradient] = costFunction(theta)

The paramters 'theta' and 'gradient' must be a vector.However, in Neural Network, paramter 'theta' is a matrix. So we must find a way to unroll the matrix.

s_1=10,s_2=10,s_3=1 \newline \Theta^{(1)}\in R^{10\times11},\Theta^{(2)}\in R^{10\times11},\Theta^{(3)}\in R^{1\times11} \newline D^{(1)}\in R^{10\times11}D^{(2)}\in R^{10\times11},D^{(3)}\in R^{1\times11}

thetaVec = [Theta1(:);Theta2(:);Theta3(:)];
DVec = [D1(:);D2(:);D3(:)];
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta3 = reshape(thetaVec(221:231),1,11);

Gradient checking

to make sure that the backpropagation and the forward propagation are correct.

Implement: gradApprox = (J(theta+EPSILON)-J(theta-EPSILON))/(2*EPSILON)

Parameter vector \theta

\frac{\partial}{\partial\theta_1}J(\theta)\approx\frac{J(\theta_1+\epsilon,\theta_2,\theta_3,...,\theta_n)-J(\theta_1-\epsilon,\theta_2,\theta_3,...,\theta_n)}{2\epsilon} \newline ...

Implementation Note:

Random initialization

If use 'zero initialization', after each update, parameters corresponding to inputs going into each of two hidden units are identical.

Random initialization: Symmetry breaking

Initialize each \Theta_{ij}^{(l)} to a random value in [-\epsilon, \epsilon]

Put it together

training a neural network

Pick a network architechture (connectivity pattern between neurons)

  1. Randomly initialize weights
  2. Implemnet forward propagation to get h_\Theta(x^{(i)}) for any x^{(i)}
  3. Implement code to compute function J(\Theta)
  4. Implement backprop to compute partial derivatives \frac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta)
  5. Use gradient checking to compare \frac{\partial}{\partial\Theta_{jk}^{(l)}}J(\Theta) computed using backpropagation vs. using numerical estimate of gradient of J(\Theta). Then disable gradient checking code.
  6. Use gradient descent or advanced optimization menthod with backpropagation to try to minimize J(\Theta) as a function of parameters \Theta

Autonomous driviong example

上一篇下一篇

猜你喜欢

热点阅读