What are Activation Functions?

Ravinder ram
5 min readMar 20, 2022

Activation functions is a term widely used in deep learning, especially in neural networks. It is basically a function that decides a neuron or node should fire value or not. As we know that artificial neural network is a network of lots of neurons and each neuron transmits a value or signals to another neuron. The first layer is commonly referred to as the input layer where we put input values that are further connected to the hidden layer and in the final layer, we have the output layer. Let’s see the image for a better understanding

Neural Network

As we can see that in this image each neuron have own activation function which helps the neuron to fire value then it transfers to the next neuron it goes like that until it reaches the output layer. Let us suppose we want to make a model which classifies loan approval yes or no based on input values like age 50 and salary 35k

Example
input
output

as we can see that this model gives probability 0.8 or 80% based on these particular input values discussed above.

Why do we need Activation functions?

To understand the need of activations functions like sigmoid, relu, leaky relu, step..etc. First, we have to understand the problem facing in perceptron which is a linear classifier perceptron and second most of the problems in this world are non-linear in nature. In perceptron if we do small changes in weights there can big change in our output layer or vice-versa we take threshold value in perceptron which can be scalar value like 5,6.. there is possibility of getting a huge value like 24 in the output layer but in the case of a non-linear activation function like suppose sigmoid function we have the guarantee that the value will be between 0 to 1 here we can take a threshold value 0.5.

Different Types of Non-Linear Activation functions:-

  1. Sigmoid Function:- We can also say this logit function. This function takes any real value as input and outputs values in the range of 0 to 1.This is the formula σ(x) = 1/(1+exp(-x))reference link. It is used normally in input layers. This solved a problem which we are facing in perceptron I discussed above in why we need the activation function section but we face problem in sigmoid function like data is not centric/ not in normal distribution and when we backpropagate the derivative we get is really small close to 0 this is called vanishing gradient problem. See it is not used in complex problems but it builds the groundwork for all activation functions that exist today that’s the importance of this algorithm.
  2. Tanh function:- It normally performs better than sigmoid function. It is different than sigmoid it’s range between [-1,1] unlike sigmoid function [0,1]. We face vanishing gradient problem in sigmoid function it solves this problem slightly not much. If we will see graph below line goes from zero that means it try to centric the data as well as we get derivative slightly bigger than sigmoid function. But this algorithm required more hardware resources to train that’s the drawback of this algorithm.
credit Avinash Sharma V

3. Relu:- ReLu is a non-linear activation function that is used in multi-layer neural networks or deep neural networks. It is specially used in hidden layers. Formula f(x) = max(0,x) where x is input value if x < 0, then y = 0 and if x ≥ 0, then y = 1.

relu

It requires less computationally intensive. It solves the problem of vanishing gradient that means when backpropagate neural network if we have updated weight slightly closer to 0 let’s say 0.002 then we get derivative 1 this is not possible in sigmoid function so as we can see neural network can learn even if value close to 0 if we will use relu function. Still, there is a problem in relu function suppose the update weight is negative let’s say -0.01 then we will get derivative 0 means the difference between the updated weight and the existing weight is zero this stage is called as Died activation function. This is a drawback of relu function.

4. Leaky ReLU:- Leaky ReLu is nothing more than an improved version of the ReLU activation. If we will use relu there can be a chance that we probably get the problem of died activation function same otherwise it is working fine for most of the cases.

https://paperswithcode.com/

Formula = max(0.1x, x)

As we can see maximum value would 0.1 times x that means the gradient is always greater than zero hence our neural network will learn. The only drawback we can imagine suppose we have lots of negative weight values during training neural networks then it can give us very-2 small derivatives that’s means our neural network is not learning.

5. Swish:- Swish is a lesser-known activation function that was discovered by researchers at Google. Swish is as computationally efficient as ReLU and shows better performance than ReLU on deeper models. The values for swish range from negative infinity to infinity.

Swish is a smooth, non-monotonic function that consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as Image Classification and Machine translation. It is unbounded above and bounded below & it is the non-monotonic attribute that actually creates the difference.

6. Softmax:- It is widely used in output layers. It returns the probability of each class like sigmoid let’s say we want to classify ‘Dog’ OR ‘Cat’ then it would give probability value for each class e.g for dog 0.75 and for cat 0.25.

formula

--

--