train_test_split ValueError: Input contains NaN

Question

train_test_split ValueError: Input contains NaN

cyanide

2022年3月24日 07:02

I tried to do a stratified sampling by way of train_test_split in order to save myself some trouble later. So I wrote the following lines:

from sklearn.model_selection import train_test_split

X=data_df
y=data_df.pop('class')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, stratify=y)

I got the error:

ValueError: Input contains NaN

Any help is welcome!

Topic sampling python

Category Data Science

Ben Reiniger · Accepted Answer · 2021年1月17日 16:51

(Upgrading comment to answer.)

This error message is generally pretty straightforward: you have missing values (generally one of np.nan, pd.NA, None), and whatever method you're trying to use cannot handle that.

Now train_test_split doesn't usually care about missing values: it's just splitting up the rows, so why should it care what values are in there? But, in this case you're asking to stratify on y (making the train/test split have the same proportion of each class in y), and so it does care about the values in y. So the error is because you have missing values in y.

Missing the target variable is problematic. The best thing to do is probably to drop those rows, unless there's some additional context (e.g. if your data is time-series, maybe you can impute based on the adjacent rows).

rigo · Accepted Answer · 2019年8月6日 08:50

Check to see if you have any null or nan values:

X[X.isnull() == True]

Then you have to decide what to do with those nan values. Something that is commonly done is to forward fill in place.

X.fillna(method = 'ffill', inplace = True)
y.fillna(method = 'ffill', inplace = True)

train_test_split ValueError: Input contains NaN

About