train_test_split ValueError: Input contains NaN

I tried to do a stratified sampling by way of train_test_split in order to save myself some trouble later. So I wrote the following lines:

from sklearn.model_selection import train_test_split

X=data_df
y=data_df.pop('class')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, stratify=y)

I got the error:

ValueError: Input contains NaN

Any help is welcome!

Topic sampling python

Category Data Science


(Upgrading comment to answer.)

This error message is generally pretty straightforward: you have missing values (generally one of np.nan, pd.NA, None), and whatever method you're trying to use cannot handle that.

Now train_test_split doesn't usually care about missing values: it's just splitting up the rows, so why should it care what values are in there? But, in this case you're asking to stratify on y (making the train/test split have the same proportion of each class in y), and so it does care about the values in y. So the error is because you have missing values in y.

Missing the target variable is problematic. The best thing to do is probably to drop those rows, unless there's some additional context (e.g. if your data is time-series, maybe you can impute based on the adjacent rows).


Check to see if you have any null or nan values:

X[X.isnull() == True]

Then you have to decide what to do with those nan values. Something that is commonly done is to forward fill in place.

X.fillna(method = 'ffill', inplace = True)
y.fillna(method = 'ffill', inplace = True)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.