Pandas get_dummies() rows dropping after joining back with X
I'm having an issue that I can't explain and am hoping I am missing something simple.
I have a large dataset of shape(45Million+, 51) and am loading it in for some analyses (classifiers, deep learning, basically just trying a few different things as some research for work).
I take a few steps when I load it in:
dropna()
to get rid of all rows with an na (only about 6K out of the 45M)- Use pandas
get_dummies()
to change a categorical variable with about a dozen classes into dummy variables (have also used sklearn'sonehotencoder
for this and had the same issue outlined below)
When I would run a RandomForest on a subset of the data (about 4 million rows, made using train_test_split
) I would get the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Since I dropped NAs at the start, this confused me, so I went back and checked the length of each output.
When I drop NAs, I get a length of 45356082 When I split off the categorical variable and one hot encode it, it has a length (and every variable within it does) of 45356082. We'll call this Dummy.
Here's where it gets weird-- when I join Dummy back to my original X as Xnew
, Xnew
has a length of the same as above, but the dummy variable columns now have length 45351726.
The join process is dropping like 4500 rows from the dummy columns.
Any idea why this would happen?
Here's the code I'm using:
choice_data_sub = pd.read_csv(predData.csv)
# Drop NAs
choice_data_sub = choice_data_sub.dropna()
X = choice_data_sub[[Columns1, Column2, Column3, Categorical]]
y = choice_data_sub[[NextPurchase]]
choice_data_sub = choice_data_sub.reset_index()
gametype_df = pd.get_dummies(choice_data_sub.Categorical, prefix=Game)
# merge with X
gametype_df = gametype_df.reset_index()
# X = X.reset_index() -- This breaks it in a different way, was tried as a fix
X = X.join(gametype_df)
Also! Just discovered that the following code works, but I'd still like to know why this didn't.
X = X.reset_index()
X = X.merge(gametype_df, left_index=True, right_index=True)
Topic dummy-variables pandas
Category Data Science