A bunch of tips and tricks for t
training deep neural networks is difficult. It requires knowledge and experiences in order to properly train and obtain an optimal model. In this post, I would like to share what I have learned in training deep neural networks. The following tips and tricks could be beneficial for your research and could help you speeding up a network architecture or parameter searching.
Now, let’s jump into it…
1). Before you start building your network architecture, first thing you need to do is to verify your input data into the network if an input (x) corresponds to a label (y). In case of dense prediction, make sure that the ground-truth label (y) is properly encoded to label indexes (or one-hot encoding). If not, the training won’t work.
2). Decide whether to use a pre-trained model or to train your network from scratch?
- If your dataset in your problem domain is similar to ImageNet dataset, use a pre-trained model on this dataset. The most widely used pre-trained models are VGG net, ResNet, DenseNet or Xception etc. There are many layer architectures, for instance, VGG (19 and 16 layers), ResNet (152, 101, 50 layers or less), DenseNet (201, 169 and 121 layers). Note: Do not try searching hyper-parameters by using more layers nets (e.g. VGG-19, ResNet-152 or DenseNet-201 layers net because it is computationally expensive), use less layers nets instead (e.g. VGG-16, ResNet-50 or DenseNet-121 layers). Pick one pre-trained model that you think it gives the best performance with your hyper-parameters (say ResNet-50 layers). After you obtained the optimal hyper parameters, just select the same but more layers net (say ResNet-101 or ResNet-152 layers) to increase the accuracy.
- Fine-tune few layers or only train the classifier if you have a small dataset and you can also try to insert Dropout layers after convolutional layers that you’re going to fine-tune because it can help combatting overfitting in your network.
- If your dataset is not similar to ImageNet dataset, you may consider building and training your network from scratch.
3). Always use normalization layers in your network. If you train the network with a large batch-size (say 10 or more), use BatchNormalization layer. Otherwise, if you train with a small batch-size (say 1), use InstanceNormalization layer instead. Note that major authors found out that BatchNormalization gives performance improvements if they increase the *batch-size *and it downgrades the performance when the batch-size is small. However, InstanceNormalization gives slightly performance improvements if they use a small batch-size. Or you may also try GroupNormalization.
4). Use SpatialDropout after a features concatenation if you have two or more convolution layers (say Li) operate on the same input (say F). Since those convolutional layers are operated on the same input, the output features are likely to be correlated. So that SpatialDropout removes those correlated features and prevents overfitting in the network. Note: It is mostly used in lower layers rather than higher layers.
5). To determine your network capacity, try to overfit your network with a small subset of training examples (a note from andrej karpathy). If it doesn’t overfit, increase your network capacity. After it overfits, use regularization techniques such as L1, L2, Dropout or other techniques to combat overfitting.
6). Another regularization technique is to constraint or bound your network weights. This can also help preventing the gradient explosion problem in your network since the weights are always bounded. In contrast to L2 regularization where you penalize high weights in your loss function, this constraint regularizes your weights directly. You can easily set the weights constraint in Keras:
from keras.constraints import max_norm
# add to Dense layers
model.add(Dense(64, kernel_constraint=max_norm(2.)))
# or add to Conv layers
model.add(Conv2D(64, kernel_constraint=max_norm(2.)))
7). Mean subtraction from data sometimes gives really worst performance, especially subtraction from grayscale images (I personally faced with this problem in foreground segmentation domain).
8). Always shuffle your training data, both before training and during training, in case you don’t take benefit from temporal data. This may help improving your network performance.
9). If your problem domain is related to dense prediction (e.g. semantic segmentation), I recommend you to use Dilated Residual Networks as a pre-trained model since it is optimized for dense prediction.
10). To capture contextual information around objects, use multi-scale features pooling module. This can further help improving the accuracy and this idea is successfully used in semantic segmentation or foreground segmentation.
11). Opt-out void labels (or ambiguous regions) from your loss or accuracy computation if any. This can help your network to be more confident in prediction.
12). Apply class-weights during training if you have highly imbalanced data problem. In another word, give more weights to the rare class but less weights to the major class. The class-weights can be easily computed using sklearn. Or try to resample your training set using OverSampling and UnderSampling techniques. This can also help improving the accuracy of your prediction.
13). Choose a right optimizer. There are many popular adaptive optimizers such as Adam, Adagrad, Adadelta, or RMSprop etc. SGD+momentum is widely used in various problem domains. There are two things to consider: First, if you care about fast convergence, use adaptive optimizers such as Adam, but it may get stuck in a local minima somehow and provides poor generalization (Figure below). Second, SGD+momentum can achieve to find a global minima, but it relies on robust initializations and it might take longer than other adaptive optimizers to converge (Figure below). I recommend you to use SGD+momentum since it tends to reach better optima.
14). There are three learning-rate starting points to play with (i.e. 1e-1, 1e-3 and 1e-6). If you fine-tune the pre-trained model, consider a low learning rate less than 1e-3 (say 1e-4). If you train your network from scratch, consider a learning rate greater than or equal 1e-3. You can try these starting points and adjust them to see which one works best, pick that one. One more thing, you may consider to slow down the learning rate as the training progresses by using Learning Rate Schedulers. This can also help improving the network performance.
15). Besides Learning Rate Schedule, which reduces the learning rate over times, there is another way that we can reduce the learning rate by some factors (say 10) if the validation loss stops improving in some epochs (say 5) and stop the training process if the validation loss stops improving in some epochs (say 10). This can be done easily by using ReduceLROnPlateau with EarlyStopping in Keras.
reduce = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, mode='auto')
early = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=1e-4, patience=10, mode='auto')
model.fit(X, Y, callbacks=[reduce, early])
16). If you work in dense prediction domain such as foreground segmentationor semantic segmentation, you should use skip connections since object boundaries or useful information is lost due to max-pooling operations or strided convolutions. This can also help your network to learn features mapping from feature space to image space easily, and it can help alleviating vanish gradient problem in the network.
17). More data beats a clever algorithm! Always use data augmentations such as horizontally flipping, rotating, zoom-cropping etc. This can help increasing the accuracy by large margins.
18). You have to have a high speed GPU for training, but it is a bit costly. If you wish to use a free cloud GPU, I recommend to use Google Colab. If you don’t know where to start with it, see my previous post or try various cloud GPU platforms such as Floydhub or Paperspace etc.
19). Use Max-pooling before ReLU to save some computations. Since ReLU thresholds the values with zero: f(x)=max(0,x)
and Max-pooling pools only max activations: f(x)=max(x1,x2,...,xi)
, use Conv > MaxPool > ReLU
rather than Conv > ReLU > MaxPool
.
E.g. Assume that we have had two activations from Conv
(i.e. 0.5 and -0.5):
- so
MaxPool > ReLU = max(0, max(0.5,-0.5)) = 0.5
- and
ReLU > MaxPool = max(max(0,0.5), max(0,-0.5)) = 0.5
See? the output from these two operations is still 0.5
. In this case, using MaxPool > ReLU
can save us one max
operation.
20). Consider using a Depthwise Separable Convolution operation, which is fast and greatly reduces the number of parameters compared to the normal convolution operation.
21). And last but not least don’t give up 💪. Trust yourself, you can do it! If you still don’t get a high accuracy that you have looked for yet, tweak your hyper-parameters, network architecture or training data until you get the accuracy that you’re looking for 👏.
Final words…
If you like this post, feel free to clap or share it to the world. If you have any questions, please drop in comments below. You can connect me on LinkedInor follow me on Twitter. Have a nice day 🎊.