Feature selection with information gain (KL divergence) and mutual information yields different results
I'm comparing different techniques for feature selection / feature ranking. Two of the techniques under scrutiny are the mutual information (MI) and the information gain (IG) as used in decision trees, i.e. the Kullback-Leibler divergence.
My data (class and features) is all binary.
All sources I could find state, that MI and IG are basically "two sides of the same coin", i.e. that one can be tranformed into the oher via mathematical manipulation. (For example [source 1, source 2])
Yet, when I rank my features using the two measures they do not result in the same ranking order. But if the two measures are equivalent, shouldn't the ranking be the same?
Can someone help me understand why the rankings are different?
Topic mutual-information information-theory ranking feature-selection
Category Data Science