inconsistency between y and x numbers in the Split into train and test sets

I am new to the field to the data science, and need help in the following: I am working on a data set that consists of both categorical and numerical values, first I have concatenate the two files (train and test) to apply the EDA steps on it, then I have done the EDA steps on the follow data set, applied one hot encoding, spitted the data. I am getting the following message, it seems that there is inconsistency between the y entries and the full data set, and the is logical but how can I deal with this problem.

y_train
y

target
0   1
1   1
2   1
3   0
4   1
... ...
17252   0
17253   0
17254   0
17255   0
17256   1
17257 rows × 1 column

ohe_data=ohe_data.drop(['ind'],axis=columns)
ohe_data.columns

Index(['experience', 'last_new_job', 'training_hours',
       'relevent_experience_Has relevent experience',
       'relevent_experience_No relevent experience',
       'enrolled_university_Full time course',
       'enrolled_university_Part time course',
       'enrolled_university_no_enrollment', 'education_level_Graduate',
       'education_level_High School', 'education_level_Masters',
       'education_level_Phd', 'education_level_Primary School',
       'major_discipline_Arts', 'major_discipline_Business Degree',
       'major_discipline_Humanities', 'major_discipline_No Major',
       'major_discipline_Other', 'major_discipline_STEM'],
      dtype='object')

ohe_data.shape
(28762, 19)

y = y_train
x = ohe_data

# Split into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.33, 
                                                    random_state=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ipython-input-167-aa834d5164c8 in module()
      4 X_train, X_test, y_train, y_test = train_test_split(x, y, 
      5                                                     test_size=0.33,
---- 6                                                     random_state=1)

2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    210     if len(uniques)  1:
    211         raise ValueError(Found input variables with inconsistent numbers of
-- 212                           samples: %r % [int(l) for l in lengths])
    213 
    214 

ValueError: Found input variables with inconsistent numbers of samples: [28762, 17257]

Topic dummy-variables data-science-model one-hot-encoding python

Category Data Science


Sklearn's train_test_split only permits with same no of rows for X and Y.

In your case Y shape is (17257 , 1 ) and X shape is (28762, 19)

All you have to do is reshape X and Y to both have same no of rows(observations)

  • Reshape X to be (17257,19)

    (OR)

  • Reshape Y to be (28762,1)


This issue is caused by the fact that the number of observations in your x and y variables are not the same. As you can see, in your x variable (which is the same as ohe_data) you have 28762 observations whereas your y variable only has 17257 observations. Since we don't see the code before that we can't say what is causing this difference.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.