Activation Functions: All neural network layers’ outputs must get passed through activation functions. These are usually non-linear and allow a neural network to learn even highly non-linear output mappings from input.
This post tries to serve as a central place to get a graphical understanding of most common activation function in neural networks. I am focusing on the ones that are used in Computer Vision.
I also plotted the gradient in the same plot so you can get an idea about that too. Remember that during back-propagating the gradient through activation function, the gradient value will get multiplied with the derivative of the activation function. For example, during back-propagation,
- In case of a linear activation, the gradient gets multiplied by 1.
- In case of ReLU, if the activation value was +ve, then the gradient gets multiplied by 1.
- In case of ReLU, if the activation value was -ve, then the gradient gets multiplied by 0.
Hence, the value of activation functions’ gradient matters.
Linear Activation Function
Let’s start with the simplest activation function: Linear. This activation function is equivalent to skipping the activation function. If a neural network with Dense only layers is using Linear activation everywhere, then the hidden layers serve no purpose and the network can be replaced by just one output layer connected to the inputs. Still, sometimes the linear activation would be used for output layers or in intermittent layers as activation functions. For example, MobileNet-v2 uses linear activation in the last convolution of each inverted bottleneck block.
Sigmoid Activation Function
Sigmoid activations used to be most common before they were replaced by ReLU because the Sigmoid scales down any gradient that tries to pass through it during back-propagation. Even in the best case, the gradient would get scaled down by a factor of 0.25 (see the peak value of the yellow curve in below picture). This caused problems when we had a deep neural network where by the time the gradient reaches the first layer, it would have completely vanished due to scaling by layers in between. And hence the name vanishing gradient. Still, many neural networks would use it in output units when the expected output must stay between 0 and 1.
Tanh Activation Function
Tanh served as an alternative to Sigmoid activations because it did not suffer from the vanishing gradient problem as as much. However, computation of the Tanh function takes more time due to the exponent term in it. Compared to that, ReLU is much quicker and works just as well.
All plots below are shown till y=7 for better visibility.
ReLU and Softplus Activation Functions
ReLU (Rectified Linear Unit) is simpler to compute and allows the gradient to simply pass-through during back-propagation (derivative of ReLU is 1) if the activation was +ve. This is by-far the most common activation function used. If your output needs to be different from [0, ∞], then you may consider using a different activation for output unit while keeping ReLU activation for internal layers.
The ReLU is non-differentiable at 0. To overcome this, the gradient at 0 is forced to be 0.
Softplus’ derivative is sigmoid function. Softplus is meant to handle the non-differentiability of ReLU. It’s output and gradient values are same as ReLU, except it is smooth at zero. This smoothness comes at the cost of extra compute required due to the logarithm and exponential computation. You would have to try and see for yourselves if the accuracy improvements (if any) are worth replacing ReLU with this.
SELU and ELU Activation Functions
α is a constant. The plots below assumes it to be 1 for ELU.
The plots below take α ~ 1.6732 and λ ~ 1.0507 as in the original paper for SELU.
One major limitation with ReLU is that when the activation is -ve, then no gradient flows through it. So even if the output values were outright bad, still not updates would be performed through ReLU just because the activation was -ve.
To overcome this, ELU replaces the -ve half of ReLU with an exponential function. SELU, just multiplies the ELU with a constant factor hoping to get slightly better results.
ReLU6 Activation Function
This is used when we want the network to run with quantized weights. During the quantization, if all the activation outputs are limited to a shorter range, then we can use an integer/char to store the weight with less loss due to rounding off. I don’t think ReLU6 helps with reduced precision floating point operations (float16), its just that we use 1 byte to “store” the trained model file and that saves disk space or bandwidth in case the model needs to be transmitted over-the-air to its destination. Ultimately, the model will always convert the byte to float (float32, float16) during inference.
We choose 6 specifically, because 6 is 111 in binary. So, the integer portion can be stored in 3 bits and rest of 5 bits can be used to store the fraction (no sign bit because output is always +ve). This allows us to store a fraction value in a single byte, although with limited granularity. We could also do ReLU3, which could also give use 6 bits for fraction, but I guess it wasn’t done because the results weren’t as good as ReLU6.
I think, when we choose ReLU6 instead of ReLU, the performance may drop due to restricted output values even if we use the float32 for both.
When using a byte to store ReLU6 output, the performance would always drop (unless your ReLU model was overfitting). But you may be OK with that trade-off if it’s the model-compression that you’re after.
Swish and Mish Activation Functions
Where β is either fixed or learnable. There’s not a big difference on performance when we make it learnable. The plot below assumes it to be 1.
Swish and Mish are new activation functions introduced in 2017 and 2019. Don’t know the motivation behind the naming of Swish but the Mish maybe named after the last name of the author (Mishra). Both are non-mononic and try to stay close to ReLU activation both in output value and gradient.
Swish experiments with different scaling factors, but the results are not that different than what we get without any scaling. Mish claims slightly better results and better performance with deeper network than Swish but it comes with a heavier compute requirement. The choice may depend on how much 1% accuracy matters to your use case.
Other Recommended Readings
- TridentNet Explained: Beginner friendly intro to handling multiple scales in object detection using Dilated Convolutions.