VIF Vs Mutual Info

I was searching for the best ways for feature selection in a regression problem & came across a post suggesting mutual info for regression, I tried the same on boston data set. The results were as follows: # feature selection f_selector = SelectKBest(score_func=mutual_info_regression, k='all') # learning relationship from training data f_selector.fit(X_train, y_train) # transform train input data X_train_fs = f_selector.transform(X_train) # transform test input data X_test_fs = f_selector.transform(X_test) The scores were as follows: Features Scores 12 LSTAT 0.651934 5 RM …
Category: Data Science

Feature selection with information gain (KL divergence) and mutual information yields different results

I'm comparing different techniques for feature selection / feature ranking. Two of the techniques under scrutiny are the mutual information (MI) and the information gain (IG) as used in decision trees, i.e. the Kullback-Leibler divergence. My data (class and features) is all binary. All sources I could find state, that MI and IG are basically "two sides of the same coin", i.e. that one can be tranformed into the oher via mathematical manipulation. (For example [source 1, source 2]) Yet, …
Category: Data Science

How to fix my CSV files? (ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required)

I have tried to import two csv files into df1 and df2. Concatenated them to make df3. I tried to call the mutual_info_regression on them but I am getting a value error ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required. I have checked the dimensions of X, y, and discrete_features. They all seem okay. Since the code works with other csv files (I have tested), I think the problem is with my csv …
Category: Data Science

Pipelines with categorical and nan values

I am trying a Regression model on a dataset which has categorical and numerical variables along with nan values. I want to use Pipelines for imputation and encoding purposes. Now I have a few conditions which must be satisfied in building the model which are as follows: 1.) Use of Pipelines is a must for imputation and encoding (one hot encoding) purpose. 2.) Imputation should be done AFTER train test split. 3.) For feature selection (should be done AFTER train …
Category: Data Science

A measure of redundancy in mutual information

Mutual information quantifies to what degree $X$ decreases the uncertainty about $Y$. However, to my understanding, it does not quantify "in how many ways" $X$ decreases the uncertainty. E.g., consider the case where $X$ is a 3D vector, and consider $X_1=[Y,0,0]$ vs. $X_2 = [Y,Y^2, 3.5Y]$. Intuitively, $X_2$ contains "more information" about $Y$, or is more redundant with respect to $Y$, than $X_1$; but if I understand correctly, both have the same mutual information. Is there an alternative information-theoretic measure …
Category: Data Science

Visualizing mutual information of each convolution layer for image classification problem

I recently came across this paper where the author has proposed a compression based theory on understanding the layers of a DNN. In order to visualize what was going on the authors showed Figure 2 of this paper which is also shown as a video here. For my image classification problem I want to visualize the mutual information exactly in this format. Can someone kindly explain to me how to calculate this numerically for images passing through conv layers in …
Category: Data Science

Does Sample Size affects Mutual Information for Feature Selection?

There is a dataset with n rows (samples) and p columns (variables/features), the objective is to predict a certain target variable (y). Should n (sample size) matter to the results of pairwise mutual information tests between every feature and y ? Meaning if n is too small or too large, the results can't be trusted ? My intuition says no, but I'm not fully confident. And is there a good reason, besides domain knowledge, to not exclude a variable that …
Category: Data Science

Understanding math notation in infoGAN paper

I'm reading this paper about mutual information in infoGAN infoGAN_paper_link and already have the code to run it. I pretty much found code for it which is fine and dandy except for the fact that I kinda don't understand some of the code in the cost function. So, I looked at the paper to dissect it for better understanding and came across some math notation that I don't understand (pic below). The usage of the notations I'm trying to figure …
Category: Data Science

Mututal Information in sklearn

I expected sklearn's mutual_info_classif to give a value of 1 for the mutual information of a series of values with itself but instead I'm seeing results ranging between about 1.0 and 1.5. What am I doing wrong? This video on mutual information (from 4:56 to 6:53) says that when one variable perfectly predicts another then the mutual information score should be log_2(2) = 1. However I do not get that result: import pandas as pd from sklearn.metrics import confusion_matrix y …
Category: Data Science

How does Mutual Information handle background overlap

I have been reading about mutual information in Image Registration. It's in the literature that MI is better able to handle the cases with the large background where anatomical structures are not aligned than entropy. Can someone provide intuitive information regarding how can MI handle such cases? Thanks in advance
Category: Data Science

When should mutual information be used for feature selection over other feature selection methods like correlation, ANOVA , etc?

I have a data set with categorical and continuous/ordinal explanatory variables and continuous target variable. I tried to filter features using one-way ANOVA for categorical variables and using Spearman's correlation coefficient for continuous/ordinal variables.I am using p-value to filter. I then also used mutual information regression to select features.The results from both the techniques do not match. Can someone please explain what is the discrepancy and what should be used when ?
Category: Data Science

Difference between Information Gain and Mutual Information for feature selection

What is the difference between information gain and mutual information? At this point, I understand that information gain is calculated between a random variable and target class for classification while mutual information is calculated between two random variables. Does mutual information become the same as information when it is calculated between a random variable and target class?
Category: Data Science

Several independent variables based on the same underlying data

I got a data containing, among others, two feature variables, which are based from the same underlying data (i.e. have mutual information), but they convey different information/message. How to handle such cases? Since, logically, they will be highly correlated, it would make sense to only use one of them, preferably the one which conveys more information. But: Is this the correct approach, or do we actually lose a valuable information by not including it? If including it is the correct …
Category: Data Science

Upper bound on 'relatedness'?

We have ~100 answers to a questionnaire with five questions (Q5). Independently from that, we have about 50, somewhat overlapping, features describing the people who answers the questions (F50). After having thrown an impressive amount of 'black box' regression models at trying to predict any of the 5 answers from the 50 features, we are approaching the conclusion that the features are just completely orthogonal to the topic of the questionnaire. This is interesting, and a little surprising, and it …
Category: Data Science

Conditional Entropy and Mutual Information - Clustering evaluation

First of all, I am doing clustering and I have the true labels for my data. For evaluation, I am using the weighted average of the entropy values for each predicted cluster. I also came across with Mutual Information as a similar approach while going over the alternatives. On my data, they seem to give similar results. However there is one issue that puzzles me. Given the predicted cluster set $U$ and true clusters $V$, mutual information was defined as: …
Category: Data Science

Concept of Mutual Information

I want to get mutual information in iris dataset to select best feature but i confused about mutual information. What is concept of mutual information for selecting feature? Can anyone explain it in simple way? You do not really understand something unless you can explain it to your grandmother. Albert Einstein
Category: Data Science

PMI between lemma vs surface

I was wondering whether it's possible to compute the some sort of pointwise mutual information between lemma and its surface form. First if we assume, p('to go') = count('to go') / sum(all lemmas) p('went') = count('went') / sum(all words) Breakpoint here, since every word comes with its respective lemma, we have the condition that sum(all lemmas) == sum(all words) The joint probability is also a little hard to normalize # count of "went" being lemmatize to "to go p('went', 'to …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.