Merge two dataframes on multiple columns, only if not NaN

Question

Merge two dataframes on multiple columns, only if not NaN

krbcaypwd

2022年1月22日 15:00

Given two Pandas dataframes, how can I use the second dataframe to fill in missing values, given multiple key columns?

Col1  Col2 Key1  Key2  Extra1                   Col1     Col2    Key1  Key2         Col1  Col2 Key1  Key2  Extra1
-------------------------------------------------------------------------------------------------------------------
[A, B, 1.10, 1.11, Alice]                                                     [A, B, 1.10, 1.11, Alice]         # left df has more non-NaNs, so leave it
[C, D, 2.10, 2.11, Bob]                   [np.nan, np.nan, 1.10, 1.11]        [C, D, 2.10, 2.11, Bob]           # unmatched row should still exist
[np.nan, np.nan, 3.10, 3.11, Charlie]     +   [E, F, 3.10, 3.11]          =   [E, F, 3.10, 3.11, Charlie]       # left df has NaN, so fill in values
[I, np.nan, 5.10, 5.11, Destiny]            [G, H, 4.10, 4.11]              [I, np.nan, 5.10, 5.11, Destiny]    # no matching values in second df, so leave it
[np.nan, J, 6.10, 6.11, Evan]                                                   [np.nan, J, 6.10, 6.11, Evan]       # no matching values in second df, so leave it

My attempt:

import pandas as pd
import numpy as np

data = [
    [A, B, 1.10, 1.11, Alice],
    [C, D, 2.10, 2.11, Bob],
    [np.nan, np.nan, 3.10, 3.11, Charlie],
    [I, np.nan, 5.10, 5.11, Destiny],
    [np.nan, J, 6.10, 6.11, Evan],
]
df1 = pd.DataFrame(data, columns = ['common_1', 'common_2', 'common_3', 'common_4', 'extra_1'])

data = [
    [np.nan, np.nan, 1.10, 1.11],
    [E, F, 3.10, 3.11],
    [G, H, 4.10, 4.11],
]
df2 = pd.DataFrame(data, columns = ['common_1_other', 'common_2_other', 'common_3_other', 'common_4_other'])


df3 = pd.merge(
    df1,
    df2,
    how=left,
    left_on=[common_3, common_4],
    right_on=[common_3_other, common_4_other],
)


Want df3 to be:
    [A, B, 1.10, 1.11, Alice]
    [C, D, 2.10, 2.11, Bob]
    [E, F, 3.10, 3.11, Charlie]
    [I, np.nan, 5.10, 5.11, Destiny]
    [np.nan, J, 6.10, 6.11, Evan]

Topic data-imputation pandas python

Category Data Science

Rose · Accepted Answer · 2022年1月22日 15:00

A simple way to do it is:

for col in df1:
    try:
        df1[col].fillna(df2[col+"_other"], inplace=True)
        
    except:
        pass

df1 output:

    common_1 common_2 common_3 common_4 extra_1
0   A           B       1.1      1.11   Alice
1   C           D       2.1      2.11   Bob
2   G           H       3.1      3.11   Charlie
3   I          NaN      5.1      5.11   Destiny
4   NaN        J        6.1      6.11   Evan

Raymond Kwok · Accepted Answer · 2022年1月21日 05:39

If you remove all the "_other" from the column names of your df2, then you can do

df1.set_index(['common_3', 'common_4']).fillna(df2.set_index(['common_3', 'common_4'])).reset_index()

This should fill nan in any of the Col1 and Col2 if there is a match in both Key1 and Key2

Merge two dataframes on multiple columns, only if not NaN

About