What is the meaning of the Gini Index?

Question

What is the meaning of the Gini Index?

Inuraghe

2021年9月24日 16:41

I'm studying random forest models, but I don't understand what the Gini index is and what it's for.

Does anyone have any material on this or can give me an explanation?

Thanks!

Topic gini-index decision-trees random-forest machine-learning

Category Data Science

Pluviophile · Accepted Answer · 2021年9月24日 16:41

Decision trees are often used while implementing machine learning algorithms. The hierarchical structure of a decision tree leads us to the final outcome by traversing through the nodes of the tree. In simple, the 'knowledge' learned by a decision tree through training is directly formulated into a hierarchical structure. This structure holds and displays the knowledge in such a way that it can easily be understood, even by non-experts.

Each node consists of an attribute or feature which is further split into more nodes as we move down the tree. But how do we decide:

Which attribute/feature should be placed at the root node?
Which features will act as internal nodes or leaf nodes?

To decide this, and how to split the tree, we use splitting measures like Gini Index, Information Gain, etc.

Gini Index

The Gini index, or Gini coefficient, or Gini impurity computes the degree of probability of a specific variable that is wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works on categorical variables, provides outcomes either be "successful" or "failure" and hence conducts binary splitting only. It isn't as computationally intensive as its counterpart – Information Gain. From the Gini Index, the value of another parameter named Gini Gain is calculated whose value is maximized with each iteration by the Decision Tree to get the perfect CART

The degree of the Gini index varies from 0 to 1,

Where 0 depicts that all the elements be allied to a certain class, or only one class exists there.
The Gini index of value as 1 signifies that all the elements are randomly distributed across various classes, and
A value of 0.5 denotes the elements that are uniformly distributed into some classes.

It was proposed by Leo Breiman in 1984 as an impurity measure for decision tree learning

Mathematically, The Gini Index is represented by

where,

$C$ is the total number of classes,
$p(i)$ is the probability of picking the data point with the class $i$

While building the decision tree, we would prefer choosing the attribute/feature with the least Gini index as the root node.

The Gini index is used in the classic CART algorithm and is very easy to calculate.

Gini Index:
    for each branch in split:
        Calculate percent branch represents #Used for weighting
        for each class in branch:
            Calculate probability of class in the given branch.
            Square the class probability.
        Sum the squared class probabilities.
        Subtract the sum from 1. #This is the Ginin Index for branch
    Weight each branch based on the baseline probability.
    Sum the weighted gini index for each split.

Implementation in python, R

References:

What is the meaning of the Gini Index?

Gini Index

About