Lecture 7 | Optimization and Gen
Convergence rate:
- SGD: one sample, one update => then decay the LR
- mini Batch => every several samples, one decaying of the LR
let's first look back
smoothing by averaging: core of momentum and its variants scale the movement in different dim, according to the average and variationbe careful of the notation: not 2nd order of derivative, but the square of the 1st order derivative
RMSprop in the idea case, the grad and the denominator will cancel out, so only the sign of the grad will remain.
the momentum only looks at the average, but rms looks at the mean square of grad.
any method using both the first and second order of grad history???? -> Adam
initially, , and delta is closed to 1, so that the early steps will be too small.
want in adam ???
beals function adadelta is the fastest in this case, sgd is slow sgd is still the slowestif the case is more complex, SGD may not be that bad.
methods considering second order of grad are usually fast enough and not swinging
designing of objection func!!! hopefully use the right one
both L2 and KL div are convex
KL div => prior knowledge that output's range is (0,1)
regression => L2
classification => KL
batch normalization???
mini-batch => assumption is that every batch covers the same region
however, they can be apart from each other
two steps of batch normalizaiton:
- move to the origin, normalization
- shift to the common location
batch normalization between the activation and the affine transformation
cant be done in SGD => can only make sense in minibatch (whole batch dont, data exists in the same region)
derivative computing can be painful now...
在上上一张slide里面已经给出了
https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time
my question here is that will two groups of inputs with different labels output the same results, if two group has dramatically different mean and the same shape of distribution?
for test data ->
- batch -> mean and var of batch????? (not sure)
- single input -> from historical mean of mean and var
regularization and overfitting
10^30 possible inputs -> full description of these points
even we have 10^15 samples, the space is still nearly vacuous
sigmoid permits these steep curves
another way to smooth output -> Deeper!
rearrange the structures 660 params -> 3layers 220NNs -> 4layers 165NN -> ... -> prefer narrow but deep NNDrop out
drop out is similar to bagging
different inputs may see different NNspseudo code
with dropout -> force NN to learn more robust model
grad is high at some region -> blow up