inconsistency between y and x numbers in the Split into train and test sets
I am new to the field to the data science, and need help in the following: I am working on a data set that consists of both categorical and numerical values, first I have concatenate the two files (train and test) to apply the EDA steps on it, then I have done the EDA steps on the follow data set, applied one hot encoding, spitted the data. I am getting the following message, it seems that there is inconsistency between the y entries and the full data set, and the is logical but how can I deal with this problem.
y_train
y
target
0 1
1 1
2 1
3 0
4 1
... ...
17252 0
17253 0
17254 0
17255 0
17256 1
17257 rows × 1 column
ohe_data=ohe_data.drop(['ind'],axis=columns)
ohe_data.columns
Index(['experience', 'last_new_job', 'training_hours',
'relevent_experience_Has relevent experience',
'relevent_experience_No relevent experience',
'enrolled_university_Full time course',
'enrolled_university_Part time course',
'enrolled_university_no_enrollment', 'education_level_Graduate',
'education_level_High School', 'education_level_Masters',
'education_level_Phd', 'education_level_Primary School',
'major_discipline_Arts', 'major_discipline_Business Degree',
'major_discipline_Humanities', 'major_discipline_No Major',
'major_discipline_Other', 'major_discipline_STEM'],
dtype='object')
ohe_data.shape
(28762, 19)
y = y_train
x = ohe_data
# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y,
test_size=0.33,
random_state=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ipython-input-167-aa834d5164c8 in module()
4 X_train, X_test, y_train, y_test = train_test_split(x, y,
5 test_size=0.33,
---- 6 random_state=1)
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
210 if len(uniques) 1:
211 raise ValueError(Found input variables with inconsistent numbers of
-- 212 samples: %r % [int(l) for l in lengths])
213
214
ValueError: Found input variables with inconsistent numbers of samples: [28762, 17257]
Topic dummy-variables data-science-model one-hot-encoding python
Category Data Science