Gini Impurity $G$ is a first order approximation to Information Gain $IG$.

To see that first note that the sum of all probabilities $p_i$ over $N$ classes is $\sum_{i=1}^N p_i=1$. So we can rewrite Gini Impurity like $$ G = 1 - \sum_{i=1}^N p_i^2 = \sum_{i=1}^N p_i - \sum_{i=1}^N p_i^2 = \sum_{i=1}^N p_i(1-p_i) $$ To rewrite Information Gain note that the natural logarithm can be expressed as an infinite series (see like $$ \ln(1-x) = -x - \frac{1}{2}x^2 - \frac{1}{3}x^3 - \ldots $$ The Information Gain can then be expressed like $$ \begin{align} IG &= - \sum_{i=1}^N p_i \ln(p_i) = - \sum_{i=1}^N p_i \ln(1 - (1-p_i)) \\ &= - \sum_{i=1}^N p_i \left[-(1-p_i) - \frac{1}{2}(1-p)^2 - \frac{1}{3}(1-p)^3 + \ldots \right] \\ &= \sum_{i=1}^N \color{red}{p_i } \left[\color{red}{(1-p_i)} + \frac{1}{2}(1-p)^2 + \frac{1}{3}(1-p)^3 + \ldots \right] \end{align} $$ The first term (in red) in this infinite series to compute the Information Gain is exactly the Gini Impurity $G$.

So which metric to use? If you need maximum speed use the Gini Impurity. If you need the theoretical exact solution use Information Gain. And as a compromise you can for example use the second order approximation to Information Gain: $$ IG \approx \sum_{i=1}^N p_i (1-p_i) + p_i\frac{1}{2}(1-p)^2 = \frac{1}{2}\sum_{i=1}^N p_i(1-p_i)(3-p_i) $$ The plot below (screenshot from Wolfram Alpha with $x=p_i$) shows Gini Impurity in red, Information Gain in blue and the second order approximation in green. The second order approximation to Information Gain sits basically in the middle between the Gini Impurity and Information Gain.
For the case of a variable with two values, appearing with fractions $f$ and $(1-f)$,
the gini and entropy are given by:

$gini = 2*f(1-f)$

$entropy = f*ln\big({1\over f}\big) + (1-f)*ln\big({1\over(1-f)}\big)$

These measures are very similar if scaled to $1.0$ (plotting $2*gini$ and ${entropy\over ln(2)}$):

Gini (y4,purple) and Entropy (y3,green) values scaled for comparison

Gini impurity and Information Gain Entropy are pretty much the same. And people do use the values interchangeably. Below are the formulae of both:

  1. $\textit{Gini}: \mathit{Gini}(E) = 1 - \sum_{j=1}^{c}p_j^2$
  2. $\textit{Entropy}: H(E) = -\sum_{j=1}^{c}p_j\log p_j$

Given a choice, I would use the Gini impurity, as it doesn't require me to compute logarithmic functions, which are computationally intensive. The closed-form of its solution can also be found.

Which metric is better to use in different scenarios while using decision trees?

The Gini impurity, for reasons, stated above.

So, they are pretty much the same when it comes to CART analytics.

Helpful reference for computational comparison of the two methods

As per parsimony, principal Gini outperform entropy as of computation ease (log is obvious has more computations involved rather that plain multiplication at processor/machine level).

But, entropy definitely has an edge in some data cases involving high imbalance.

Since entropy uses log of probabilities and multiplying with probabilities of event, what is happening at background is value of lower probabilities are getting scaled up.

If your data probability distribution is exponential or Laplace (like in case of deep learning where we need probability distribution at sharp point) entropy outperform Gini.

To give an example if you have $2$ events one $.01$ probability and other $.99$ probability.

In Gini probability squared will be $.01^2+.99^2$, $.0001 + .9801$ means that lower probability does not play any role as everything is governed by the majority probability.

Now in case of entropy $.01*log(.01)+.99*log(.99)= .01*(-2)+ .99*(-.00436) = -.02-.00432$ now in this case clearly seen lower probabilities are given better weight-age.

Gini is intended for continuous attributes and Entropy is for attributes that occur in classes

Gini is to minimize misclassification
Entropy is for exploratory analysis

Entropy is a little slower to compute

I've been doing optimizations on binary classification for the past week+, and in every case, entropy significantly outperforms gini. This may be data set specific, but it would seem like trying both while tuning hyperparameters is a rational choice, rather than making assumptions about the model ahead of time.

You never know how data will react until you've run the statistics.

Entropy takes slightly more computation time than Gini Index because of the log calculation, maybe that's why Gini Index has become the default option for many ML algorithms. But, from Tan et. al book Introduction to Data Mining

"Impurity measure are quite consistent with each other... Indeed, the strategy used to prune the tree has a greater impact on the final tree than the choice of impurity measure."

So, it looks like the selection of impurity measure has little effect on the performance of single decision tree algorithms.

Also. "Gini method works only when the target variable is a binary variable." - Learning Predictive Analytics with Python.

Generally, your performance will not change whether you use Gini impurity or Entropy.

Laura Elena Raileanu and Kilian Stoffel compared both in "Theoretical comparison between the gini index and information gain criteria". The most important remarks were:

  • It only matters in 2% of the cases whether you use gini impurity or entropy.
  • Entropy might be a little slower to compute (because it makes use of the logarithm).

I was once told that both metrics exist because they emerged in different disciplines of science.

To add upon the fact that there are more or less the same, consider also the fact that: $$ \begin{split} \forall \; 0 < u < 1,\; \log (1-u) &= -u - u^2/2 - u^3/3 \, + \, \cdots\\ \forall \; 0 < p < 1,\; \log (p) &= p-1 - (1-p)^2/2 - (1-p)^3/3 \, + \, \cdots\\ \end{split} $$ so that: $$ \forall \; 0 < p < 1,\; -p \log (p) = p(1-p) + p(1-p)^2/2 + p(1-p)^3/3 \, + \, \cdots $$ See the following plot of the two functions normalised to get 1 as maximum value: red curve is for Gini while black one is for entropy. Normalised Gini and Entropy criteria

In the end as explained by @NIMISHAN Gini is more suitable to minimise misclassfication as it is symetric to 0.5, while entropy will more penalised small probabilities.


