The post Plotting Activation Functions & Gradients in Neural Networks appeared first on govind.tech.
]]>This post tries to serve as a central place to get a graphical understanding of most common activation function in neural networks. I am focusing on the ones that are used in Computer Vision.
I also plotted the gradient in the same plot so you can get an idea about that too. Remember that during back-propagating the gradient through activation function, the gradient value will get multiplied with the derivative of the activation function. For example, during back-propagation,
Hence, the value of activation functions’ gradient matters.
Let’s start with the simplest activation function: Linear. This activation function is equivalent to skipping the activation function. If a neural network with Dense only layers is using Linear activation everywhere, then the hidden layers serve no purpose and the network can be replaced by just one output layer connected to the inputs. Still, sometimes the linear activation would be used for output layers or in intermittent layers as activation functions. For example, MobileNet-v2 uses linear activation in the last convolution of each inverted bottleneck block.
Sigmoid activations used to be most common before they were replaced by ReLU because the Sigmoid scales down any gradient that tries to pass through it during back-propagation. Even in the best case, the gradient would get scaled down by a factor of 0.25 (see the peak value of the yellow curve in below picture). This caused problems when we had a deep neural network where by the time the gradient reaches the first layer, it would have completely vanished due to scaling by layers in between. And hence the name vanishing gradient. Still, many neural networks would use it in output units when the expected output must stay between 0 and 1.
Tanh served as an alternative to Sigmoid activations because it did not suffer from the vanishing gradient problem as as much. However, computation of the Tanh function takes more time due to the exponent term in it. Compared to that, ReLU is much quicker and works just as well.
All plots below are shown till y=7 for better visibility.
ReLU (Rectified Linear Unit) is simpler to compute and allows the gradient to simply pass-through during back-propagation (derivative of ReLU is 1) if the activation was +ve. This is by-far the most common activation function used. If your output needs to be different from [0, ∞], then you may consider using a different activation for output unit while keeping ReLU activation for internal layers.
The ReLU is non-differentiable at 0. To overcome this, the gradient at 0 is forced to be 0.
Softplus’ derivative is sigmoid function. Softplus is meant to handle the non-differentiability of ReLU. It’s output and gradient values are same as ReLU, except it is smooth at zero. This smoothness comes at the cost of extra compute required due to the logarithm and exponential computation. You would have to try and see for yourselves if the accuracy improvements (if any) are worth replacing ReLU with this.
α is a constant. The plots below assumes it to be 1 for ELU.
The plots below take α ~ 1.6732 and λ ~ 1.0507 as in the original paper for SELU.
One major limitation with ReLU is that when the activation is -ve, then no gradient flows through it. So even if the output values were outright bad, still not updates would be performed through ReLU just because the activation was -ve.
To overcome this, ELU replaces the -ve half of ReLU with an exponential function. SELU, just multiplies the ELU with a constant factor hoping to get slightly better results.
This is used when we want the network to run with quantized weights. During the quantization, if all the activation outputs are limited to a shorter range, then we can use an integer/char to store the weight with less loss due to rounding off. I don’t think ReLU6 helps with reduced precision floating point operations (float16), its just that we use 1 byte to “store” the trained model file and that saves disk space or bandwidth in case the model needs to be transmitted over-the-air to its destination. Ultimately, the model will always convert the byte to float (float32, float16) during inference.
We choose 6 specifically, because 6 is 111 in binary. So, the integer portion can be stored in 3 bits and rest of 5 bits can be used to store the fraction (no sign bit because output is always +ve). This allows us to store a fraction value in a single byte, although with limited granularity. We could also do ReLU3, which could also give use 6 bits for fraction, but I guess it wasn’t done because the results weren’t as good as ReLU6.
I think, when we choose ReLU6 instead of ReLU, the performance may drop due to restricted output values even if we use the float32 for both.
When using a byte to store ReLU6 output, the performance would always drop (unless your ReLU model was overfitting). But you may be OK with that trade-off if it’s the model-compression that you’re after.
Swish and Mish are new activation functions introduced in 2017 and 2019. Don’t know the motivation behind the naming of Swish but the Mish maybe named after the last name of the author (Mishra). Both are non-mononic and try to stay close to ReLU activation both in output value and gradient.
Swish experiments with different scaling factors, but the results are not that different than what we get without any scaling. Mish claims slightly better results and better performance with deeper network than Swish but it comes with a heavier compute requirement. The choice may depend on how much 1% accuracy matters to your use case.
The post Plotting Activation Functions & Gradients in Neural Networks appeared first on govind.tech.
]]>The post TridentNet Explained appeared first on govind.tech.
]]>TridentNet attempts to tackle the problem of multi-scale objects in 2D images through dilated convolutions. The changes are applied on Faster-RCNN, hence one must have at least a basic understanding of two-stage object detectors (e.g. Faster-RCNN) first to understand TridentNet.
Note: all figures(editable) in this post can be downloaded from this Google Drive.
Link to the paper: Scale-Aware Trident Networks for Object Detection
Let’s say we would like to detect the Giraffes in this image:
Let’s take a classroom scenario. Assume that we can only use a neural network of depth 1, single channel of 2×2 Dilated Conv2D. We would like to have this layer to learn the appearance of a Giraffe.
In the picture below, each red rectangle represents the receptive field of our 2×2 Dilated Conv2D filter at three different locations during the convolution process. Each filled-green rectangle represent receptive field of one single weight (float32) of Dilated Conv2d kernel (in reality, each weight gets multiplied with one single pixel value, but for the sake of explanation, just go with it please).
While sliding our 2×2 dilated convolution kernel over the image, we find it hard to find a common dilation rate that would fit the size of all Giraffes. In the picture below, the Conv2D has a dilation rate that is failing to interact with enough pixels of the smallest Giraffe and hence cannot detect it. If we use a smaller dilation rate, it would succeed in capturing the smallest Giraffe but would fail to capture the largest one.
FYI: This is an example of why objects with varying scales are always a problem in Computer Vision.
The proposed solution: Have three parallel Dilated Conv2D layers (the depth is still 1, channel count is still 1, kernel size is still 2×2) with 3 different dilation rates. Also, share the weights, i.e. still have only 4 weights (for 2×2 kernel) overall.
As evident now, a single depth neural network can fit all three Giraffes by just using 3 dilation rates. Also, notice that the 4 quadrants of all Giraffes look the same, hence the weight sharing makes sense.
It’s a fairly simple idea. This is similar to using 3 neural networks, except that here we would be sharing all the weights across the 3 branches. The image below describes how the authors modified a ResNet block to a Trident block. The dilation rates are used only for the 3×3 convolutions and the 1×1 convolutions stay the same.
FAQs
TridentNet is a modification of Faster-RCNN. This is the only two-shot detector that the authors used in their work. Both stages (RPN and R-CNN) of Faster-RCNN were converted into 3 branch mode. Some more info:
Input images’ shorter side is scaled to 800 pixels before feeding it to the network.
This scale-aware training scheme could be applied to both RPN and R-CNN. For RPN, we select ground truth boxes that are valid for each branch according to Eq. 1 during anchor label assignment. Similarly, we remove all invalid proposals for each branch during the training of R-CNN
Authors describing modifications in Faster-RCNN
During the RPN stage in Faster-RCNN, the entire image passes through the RPN and generates region proposals that are likely to contain objects. In TridentNet, the image passes through all 3 branches. While computing loss, for each branch we only take the ground truth boxes whose scale(area) falls with the branch’s responsibility:
Here, the scale of a ground truth box is defined as (this is the Eq. 1 as mentioned in the author’s quote above)
scale = sqrt(box_width X box_height)
Each branch produces 12,000 proposals. These are then filtered by NMS (hard or soft) which gives out 500 proposals. Out of these 500 proposals, 128 are sampled (I don’t what sampling strategy is used here) and sent to the second stage (R-CNN) of object detector.
During the R-CNN stage, for each branch, we use the ROIs which has scales within the correct range. ROIs outside the branch’s responsibility are ignored.
Two inference modes are available: Default and Fast.
In the default mode, the image is passed through all three branches of the network. An NMS is performed on the predictions and that’s it.
Below is the snapshot of their performance results from the paper. The “deformable” backbone indicates the experiment where dilated convolutions (mediocre results) were replaced by deformable convolutions (better results).
In the fast mode, we realize that the default mode is using 3 times the computation power. We don’t like that. Hence, once we train our model with all 3 branches, then we delete the Small and Large branches in inference mode and derived our predictions from the Medium branch only. NMS is performed as usual.
This brings the compute-hunger back to the same as Faster-RCNN. Below table lists the performance change vs. branches:
Average Precision | |
Small branch only | 31.5 |
Medium only | 37.8 |
Large branch only | 31.9 |
All branches enabled | 40.6 |
As listed, the performance does take a minor hit with this “optimization”. This mode of inference is termed TridentNet Fast.
Thank you for reading this post. I haven’t looked into their code yet, so some details may be incomplete. Please leave a comment or feel free to reach out to me if you find any issue with this post.
The post TridentNet Explained appeared first on govind.tech.
]]>The post Understanding ‘stateful’ option in Keras LSTM appeared first on govind.tech.
]]>I would advise taking my conclusions with a grain of salt as I don’t have an extensive experience with Keras and could be mistaken. Also, this blog by Philippe Remy cleared up many of my questions and if you’re also striving to understand the ‘stateful’ option I highly recommend going through it.
For a purpose of this blog, the following terminology holds
State = Cell-State that gets passed to next time-step. Not hidden state
Sample = Sequence. If your problem is to predict a char given 5 characters preceding it, then sample size (aka sequence length) would be 5.
What we already know and agree upon:
For all experimentation, this is how the simple network has been defined:
For a stateful LSTM, I just add the flag like this:
num_units = 3, batch_size = 4
. Dataset determines the time_steps
and is 5 for our case. There are 1024 samples in training set and 256 in test set. ‘MSE’ is used as loss function.shuffle = False
flag. It is necessary if using ‘stateful‘ option.Sample | Label | |
Sample 0 | [1, X, X, X, X] | 1 |
Sample 1 | [0, X, X, X, X] | 0 |
Sample 2 | [0, X, X, X, X] | 0 |
Sample 3 | [1, X, X, X, X] | 1 |
… and so on | … | … |
Each sample is 5 time-steps long and the Label is always the same as the value of the sample at first time-step. The ‘X’ represent that the number could be either 0 or 1 and is set randomly while generating the training set.
For the sake of completion here is code to generate this dataset:
Given the trivial nature of this problem, I started by training a ‘stateless’ LSTM first. The network learns the pattern in the first epoch itself and achieves 100% accuracy on test set:
This is expected as the problem is easy. Next, I use ‘stateful’ LSTM on the same setup. Here is the console log for training for 5 epochs:
Keras documentation describes ‘stateful’ as “Boolean (default False). If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch“. Philippe’s blog states, “If the model is stateless, the cell states are reset at each sequence. With the stateful model, all the states are propagated to the next batch. It means that the state of the sample located at index i, Xi will be used in the computation of the sample (Xi+bs) in the next batch, where bs is the batch size (no shuffling).“
Seems clear enough. the state from the current batch will be used for the collocated samples in next batch. Below is a pictorial representation of stateless LSTM. The state of last time step is discarded.
‘b‘ stands for batch size. I have shown two samples belonging to two successive batches. On top the sample index (in entire training set) is ‘k‘ and on bottom sample index is ‘k + b’. x0 to x4 represents the 5 time-steps present in our problem. ‘C‘ is cell-state.
In case of a stateful LSTM, the state of last time step is given to the collocated sample in next batch:
To confirm this, I created a new problem in which current label is determined by the collocated sample in the previous batch. Here is how the dataset looks like:
Samples | Label | ||
Batch 0 | Sample 0 | [1, X, X, X, X] | 1 |
Batch 0 | Sample 1 | [0, X, X, X, X] | 1 |
Batch 0 | Sample 2 | [0, X, X, X, X] | 0 |
Batch 0 | Sample 3 | [1, X, X, X, X] | 0 |
Batch 1 | Sample 4 | [0, X, X, X, X] | 1 |
Batch 1 | Sample 5 | [0, X, X, X, X] | 0 |
Batch 1 | Sample 6 | [0, X, X, X, X] | 0 |
Batch 1 | Sample 7 | [1, X, X, X, X] | 1 |
Batch 2 | Sample 8 | [1, X, X, X, X] | 0 |
Batch 2 | Sample 9 | [1, X, X, X, X] | 0 |
Batch 2 | Sample 10 | [0, X, X, X, X] | 0 |
Batch 2 | Sample 11 | [1, X, X, X, X] | 1 |
.. and so on | … | … | … |
The first time-step of previous batch’s collocated sample determines current samples label. So in theory, if the state is able to propagate across batches, then stateless LSTM should fail to solve this problem while stateful should succeed.
Stateless LSTM network training console output
The model fails to learn anything and is probably predicting either all 0 or all 1 for entire training/test set. Surprisingly, stateful LSTM also fails on this problem:
It didn’t make any sense to me and I stayed stuck on this result for a while. I tried keeping the batch size as 1 while changing the data such that current sample’s labels is present in previous sample. But couldn’t get accurate results even then. Since stateful LSTMs can pass information across batches (only to collocated samples), it should have been able to perform well on this problem. And then, I read through the ‘Don’t use `stateful` LSTM unless you know what it does’ page on GitHub once again. And it made sense. Here are some keypoints:
“Cell t+1 will do its best to do something with state t, but state t will be random and untrained. It might learn something it might not.” and “The hidden states are randomly initialized and untrained. I don’t think most people understand that part and end up with some weird models”.
Ben points out that although making the LSTMs stateful does allow transfer of information across batches, the backpropagation cannot pass through batch boundary. And hence it would be unable to train the network to produce a useful information after the last time step. Resulting in potentially garbage state after last time step.
I don’t yet know when using ‘stateful’ option be helpful. But certainly, if I am just starting with LSTMs I may not need to use this ‘stateful’ LSTMs that much.
At this point, to make sure that we’re on the same page, I have given some example problems and an approach to solving them:
(One-step time-series prediction refers to the problem of predicting just the next value. It has nothing to do with ‘time-steps’ in LSTM. Once the prediction is made and evaluated we have the true value available. And then onward this true value will be used for making predictions. e.g. predicting global population for next year. On the other hand, in multi-step time-series problems, the machine predicts many future values without getting a true future value. e.g. predicting global population for each year in next century.)
At the end, I am open to suggestions, if you find something wrong with these conclusions, please do put your thoughts in comments.
The post Understanding ‘stateful’ option in Keras LSTM appeared first on govind.tech.
]]>