How to use Splitting for startifying in sklearn for multiple files
I have csv data file for binary classification. I divided it into 5 multiple files and tried to apply the stratification technique so the class label has the same proportion for all the files. but I am getting the error
ValueError: Found input variables with inconsistent numbers of samples:
even the whole data is divisible by 5. I think the splitter takes a pandas data frame as input, and I am asking it to stratify by a specific column. The output is a NumPy array that does not have names for columns. how to do this
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('C:/data1.csv')
train1, val1 = train_test_split(df , random_state=1, stratify=df['label'])
train2, val2 = train_test_split(train1, test_size=0.20, random_state=1, stratify=df['label'])
train3, val3 = train_test_split(train2, test_size=0.25, random_state=1, stratify=df['label'])
train4, val4 = train_test_split(train3, test_size=0.33, random_state=1, stratify=df['label'])
train5, val5 = train_test_split(train4, test_size=0.50, random_state=1, stratify=df['label'])
val1.to_csv(1.csv, index=False)
val2.to_csv(2.csv, index=False)
val3.to_csv(3.csv, index=False)
val4.to_csv(4.csv, index=False)
val5.to_csv(5.csv, index=False)
Topic sampling cross-validation scikit-learn machine-learning
Category Data Science