How to deal with nan values after merging / joining two dataframes?

A lot of time after merging two pandas dataframes, I end up with NaNs in the new dataframe, that's just how the way it is, because one csv does not have all the ID's that the other has (Two dataframes of different sizes for example). Those NaNs have not been present before, it's just the nature of the left join in pandas to specify that missing data as NaN. So some rows have NaN values in some columns. My question is how to deal with those values from a data science point of view ? Should I remove them ? Should I replace them ? What to do if you cannot replace them with mean or median ? What are the best practices for this ? Am I even doing the right thing by merging the two dataframes if I end up with missing values ? Should the missing values resulting from merging two dataframes be dealt with as normal missing data ?

Topic data-engineering pandas dataset data-cleaning

Category Data Science


You probably should

  1. conduct a missing values analysis to see what is the percentage of missing per column (figure below, from dataprep package)
  2. Decide a threshold according to which you may want to completely drop a column or not (depending on how your analysis or model treats nans as well)
  3. For the columns that are not dropped, you should impute the missing values experimenting with relevant techniques, e.g. average, std etc. (also depends on the type of the data and feature). https://scikit-learn.org/stable/modules/impute.html

Dataprep package


nan values in pandas and other python packages represent missing data. In other languages they are often called NULL, NA or similar. They can arise when you left join two tables and the right table has no corresponding element in the left table. Or they can be entered manually. The interpretation is just "missing data". So ideally you want to keep them to keep track of what was missing.

Unlike some other languages, python does not have a null element of each type. pandas uses the float nan for missing data which was actually only meant to represent "not a number", floating point results of undefined mathematical operations like 0/0, inf/inf, etc.

This is a frequent cause of trouble (you are processing strings, and once in a while you have these nans of entirely different type). For this reason you might want to use, for example, the isna function to filter them or the fillna function to replace them with some other value like "" for strings.

Pandas itself has a page on dealing with missing data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.