Pandas get_dummies() rows dropping after joining back with X

I'm having an issue that I can't explain and am hoping I am missing something simple.

I have a large dataset of shape(45Million+, 51) and am loading it in for some analyses (classifiers, deep learning, basically just trying a few different things as some research for work).

I take a few steps when I load it in:

  • dropna() to get rid of all rows with an na (only about 6K out of the 45M)
  • Use pandas get_dummies() to change a categorical variable with about a dozen classes into dummy variables (have also used sklearn's onehotencoder for this and had the same issue outlined below)

When I would run a RandomForest on a subset of the data (about 4 million rows, made using train_test_split) I would get the following error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Since I dropped NAs at the start, this confused me, so I went back and checked the length of each output.

When I drop NAs, I get a length of 45356082 When I split off the categorical variable and one hot encode it, it has a length (and every variable within it does) of 45356082. We'll call this Dummy.

Here's where it gets weird-- when I join Dummy back to my original X as Xnew, Xnew has a length of the same as above, but the dummy variable columns now have length 45351726.

The join process is dropping like 4500 rows from the dummy columns.

Any idea why this would happen?

Here's the code I'm using:

choice_data_sub = pd.read_csv(predData.csv)
# Drop NAs
choice_data_sub = choice_data_sub.dropna()

X = choice_data_sub[[Columns1, Column2, Column3, Categorical]]
y = choice_data_sub[[NextPurchase]]

choice_data_sub = choice_data_sub.reset_index()

gametype_df = pd.get_dummies(choice_data_sub.Categorical, prefix=Game)

# merge with X
gametype_df = gametype_df.reset_index()
# X = X.reset_index() -- This breaks it in a different way, was tried as a fix
X = X.join(gametype_df)

Also! Just discovered that the following code works, but I'd still like to know why this didn't.

X = X.reset_index()
X = X.merge(gametype_df, left_index=True, right_index=True)

Topic dummy-variables pandas

Category Data Science


I'll answer this with the solution that worked for me (OP here), but should point out that I still do not understand why it did not work as listed. (But, hey, as long as it worked, right?)

Using merge instead of join, and merging on the recently reset indexes got things worked out.

X = X.reset_index()
X = X.merge(gametype_df, left_index=True, right_index=True)

Would still love some feedback from anyone that can explain the above join error!


Can you post your code when you join the datasets? It's possible you're joining in a different way than you had intended. See the following helpful tutorial for figuring out exactly how you want to join datasets: https://pandas.pydata.org/docs/user_guide/merging.html

Additionally - I would check that the X and dummy dataframes have the same indexes. It's possible that in your join these forgotten entries simply don't share the same index in both dataframes.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.