Few activation functions handling various problems - neural networks

How can a few activation functions in neural networks handle so many different problems?

I know some basics theory behind ANN, but I can't get what functions like the sigmoid function etc. have in common with for example image classification?

Topic activation-function deep-learning neural-network

Category Data Science


To answer your question of why activations are needed in examples like image classification(I had the same feeling why in the world is an activation function necessary and why is it really there in a neural network).

So, there are a few things which you should know before hand:

  1. Neural networks are simply very "DEEP COMPOSITE FUNCTIONS" and they use gradient descent through backpropagation to optimize the objective function or loss function.

  2. The way neural networks were designed is quite extraordinary simple and they are actually very dumb algorithms as stated by Mr.Hinton and they work so well shows that the analogy drawn from the human brain is almost accurate to the way human brain makes decisions.

  3. Now a neural network consists of neurons called perceptrons that take in the weighted sum of inputs and then pass that weighted sum of inputs through the activation function which is then passed on to the next layer this is how a feedforward pass in neural network looks like.

  4. As stated earlier neural nets use something called gradient descent through backprop to minimize the loss function. Now, there are two main reasons for using an activation function:

    4.1 Mr.hinton who proposed this method i.e. backprop for optimization of neural networks through gradient descent, gradient descent simply utilizes the composite functional nature of neural networks to calculate the rate of change of loss function w.r.t the weights(parameters of your neural network and the layer by layer structure helps in somewhat greedy optimization through a divide and conquer kind of strategy ) and biases now to calculate this you need some continuous functions which are differentiable and also have one to one mapping(so that for any particular input you get a specific output value that is unique to that input that is why periodic functions don't work) so that we are able to calculate their derivatives and hence solve our optimization problem because passing your weighted sum of inputs directly will make it non-differentiable for gradient descent(because it can e discrete as well as continuous but you will never be sure of it ) that is why it is first passed through an activation so that the outputs are now continuous and are monotonically increasing(as in the case with sigmoid and tanh).

    4.2 The second reason was the way the working of neural network was derived from the human brain, we as humans when we see things or make decisions certain parts of our brain are more activated as compared to others now using this analogy Mr. Hinton argued that neural networks have the universal representation capability that can represent any kind of data now while making predictions certain neurons may fire and certain neurons may not fire to measure them and compare the firing power of neurons we needed a scale that could measure every neuron's firepower on a comparable scale and that is where activation functions came in scaling outputs as per the range of that activation. Also, the strictly increasing nature of activation allows a neural net to learn better representation of the data in the multidimensional space where the data manifold exists.

Now to answer the second part:

Sine is a periodic function. Hence, a low and a high input value might produce the same output(it is kind of a many to one mapping); in other words, low and high input values are seen alike through a sine, which throws out the magnitude of the input signal which means your neural network may not even learn anything.


Two keys things which work in favour of sigmoid and tanh functions are:

  1. They are bounded functions over their entire domain(which happens to be the entire real line).
  2. Their derivatives are easy to derive and computationally inexpensive.

Also, the purpose of using sigmoid and tanh are different. Their boundaries are (0,1) and (-1,1) respectively. So, sigmoid is good for calculating probability of an event and tanh is better for classifying two classes.

A basic problem in image classification is predicting whether an image has a cat in it or not. So, using tanh function we get a final value of which ranges be (-1,1). So, you can decide basis whether the final value is negative or positive to reach the final conclusion of image classification.


To respond to your additional question of why sigmoid or tanh instead of sin or cos, I would say that while all of these functions are bounded in their output, only sigmoid and tanh are one to one functions.


Image classification and other task can be expressed as function approximations and, in theory, neural networks can approximate (almost) any function (given few assumptions on the activation function (see Universal Approx. Theorem)).

However, in practice, not all functions fulfilling these assumptions work equally well. Popular activation functions usually share some properties that allow neural networks to learn efficiently in practice, e.g. they are continuously differentiable (for gradient descent), close to the identity near zero (accelerates initial learning from small random weights) and so on.

Activation functions like the sigmoid function are not directly related to image classification or any other tasks. Rather, they allow for efficient training of neural networks, which, in turn, can represent a wide variety of tasks using different architectures and cost functions.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.