Multiple Merges make the data frame in pandas to explode and causing Memory Issue in jupyter notebook

Question

Multiple Merges make the data frame in pandas to explode and causing Memory Issue in jupyter notebook

Devendran

2021年11月10日 13:57

I have made a multiple merges using pandas data frame (refer the example script below). It made the data frame to explode and consume more memory as it records reach to 18 Billion in df3 and try to merge with 5Lack records in df4.

This causing the memory issue. It consumes the whole memory in RAM(140 GB of memory) and session got killed.

df = df1[df1_columns].\
    merge(
        df2[df2_columns],
        how='left',
        left_on='col1',
        right_on='col2'
    ).\
    merge(df3[df3_columns], how='left', on='ID').\
    merge(df4[df4_columns], how='left', on='ID')
)

Appreciate if have any solutions to tackle this joins causing an issue.

Topic jupyter azure-ml pandas python

Category Data Science

Multivac · Accepted Answer · 2021年11月6日 23:51

First things first:

Check that you are joining by primary keys only i.e, yo do not have any duplicate value at any of the columns you are joining, otherwise you will end up with a huge and unnecessary number of duplicated rows.

When joining multiple tables I always go for functional programming using reduce this will boost the performance on merge operation:

from functools import reduce

#create a list that contains all your data frames to be merged
ldf = [df1,df2,df3,df4]
#necessary to have a common name column to merge all the frames
df = reduce(lambda x,y: pd.merge(x,y, on = "id", how = "left"), ldf)

Hope it helps!

Erwan · Accepted Answer · 2021年11月5日 12:03

Without any information about the data it's hard to answer.

First, you should be able to estimate whether the merge should indeed produce 18 billion rows: is it the expected result or is it a mistake? You can investigate this by using a subset of your real data in one or both of the dataframes, and see how fast the resulting dataframe grows.

If this merge operation is a full cartesian product with two huge dataframes, then it could be the normal result and it's a problem of design. Obviously there's a physical limit on the amount of memory, at some point it's not surprising that you hit this limit with this kind of size. In this case a solution could be to implement a specific code with a more efficient data structure, or using disk storage instead of memory storage.

Multiple Merges make the data frame in pandas to explode and causing Memory Issue in jupyter notebook

About