test_train_split with stratify integer overflow

Question

test_train_split with stratify integer overflow

tk78

2022年6月2日 23:07

I'm trying to do a stratified split for a skewed dataset with target variable 'b'. The target variable is a bit value (either 0 or 1). Here's an example:

df = pd.DataFrame(data={'a': np.random.rand(100000), 'b': 0})
df.loc[np.random.randint(0, 100000, 1000), 'b'] = 1
tr, ts = train_test_split(df, test_size=.2, stratify=df['b'])
print(tr.shape, ts.shape)

This code returns the following:

(93105, 2) (38, 2)

My problem is that the returned train/test arrays do not meet the set split ratio of 20%.

My setup:

Python 3.7.0 (32bit)
Sklearn 0.20.3
Pandas 0.23.4

I discovered that the problem is resulting from an integer overflow in the underlying split function.

How can I resolve this issue and is this a known bug? I couldn't find anything helpful.

Topic scikit-learn python

Category Data Science

ASH · Accepted Answer · 2020年2月24日 19:51

How about this?

# Split the data between the Training Data and Test Data
xTrain , xTest , yTrain , yTest = train_test_split(X , y , 
                                                  test_size = 0.30 , 
                                                  random_state = 0, 
                                          ----->  stratify = y)

test_train_split with stratify integer overflow

About