Sharp Minima Can Generalize For
Sharp Minima Can Generalize For Deep Nets, https://arxiv.org/abs/1703.04933
Conventionally, many researchers hold the view that the flatness of the minima in DNN contributes to its generalization ability. This paper argues that common measures fail to descripe the flatness of DNN.
Three kinds of flatness measures are investigated, $\epsilon$-flatness, curvature and $\epsilon$-sharpness.
Given $\epsilon > 0$, a minimum $\theta$, and a loss $L$, we define $C(L, \theta, \epsilon)$ as the largest (using inclusion as the partial order over the subsets of $\Theta$) connected set containing $\theta$ such that $\forall \theta' \in C(L, \theta, \epsilon), L(\theta') < L(\theta) + \epsilon$. The $\theta$-flatness will be defined as the volume of $C(L, \theta, \epsilon)$. We will call this measure the volume $\epsilon$-flatness.
Let $B_2(\epsilon, \theta)$ be an Euclidean ball centered on a minimum $\theta$ with radius $\epsilon$. Then, for a non-negative valued loss function $L$, the $\epsilon$-sharpness will be defined as proportional to
$\frac{\max_{\theta' \in B_2(\epsilon, \theta)}(L(\theta') - L(\theta))}{1+L(\theta)}$
New terms
- Lipschitz constant
To be honest, I don't understand the detail in the paper. But it worth re-reading.