Compare multiple values from a DataFrame against a single row from another

I'm trying to compare address values for inaccuracies, for example, given multiple records like:

Reference Apartment Address PostCode
AS097 NaN 00 Name Road BH1 4HB
AS097 Flat 1 Building Name 00 Name Road BH1 4HB
AS097 Flat 2 Building Name 00 Name Rd BH1 4HB
AS097 Flat 3 Building Name 00 Name Road BH1 4HB
AS097 Flat 4 Building Name 00 Name Road BH1 4HB
AS097 Flat 5 Building Name 00 Name Road BH1 4HX
HO056 NaN 23 Street Road XG9 9GX

I've a dataframe where I store all the main addresses by checking if the Apartment column is empty like so:

main_address = df['Apartment'].isnull()

df_st = pd.DataFrame({'Reference':df[main_address].Reference, 'Address':df[main_address].Address, 'PostCode':df[main_address].PostCode})

df_st will look like this:

Reference Apartment Address PostCode
AS097 NaN 00 Name Road BH1 4HB
HO056 NaN 23 Street Road XG9 9GX

df has over 1K records, but df_st containing the main address ends up with approx. 200 records.

I'm trying to create a new DataFrame where I can identify where the records don't match by domparing df against df_st.

THE PROBLEM

I try the below:

# Clean the Reference values
refs_list = df['Reference'].str.split('/').str[0]
df['Reference'] = refs_list

# Create a new column titled issues and flag if the references match
df['issues'] = np.where(df['Reference'] == df_st['Reference'], 'True', 'False')

I want the above for the Address and PostCode unfortunately this does not work since df and df_st don't have the same shape.

I'm struggling to find a way to achieve a comparison between the two DataFrames df against df_st.

I want to compare all matching Reference row values from df against it's matching from df_st and if one of them don't match create a new column title Issues and store the conflicting column there.

MY DESIRED OUTCOME

Given the data above, after comparing df data against df_st results in a new DataFrame, like below

Reference Apartment Address PostCode Issues
AS097 NaN 00 Name Road BH1 4HB None
AS097 Flat 1 Building Name 00 Name Road BH1 4HB None
AS097 Flat 2 Building Name 00 Name Rd BH1 4HB Address
AS097 Flat 3 Building Name 00 Name Road BH1 4HB None
AS097 Flat 4 Building Name 00 Name Road BH1 4HB None
AS097 Flat 5 Building Name 00 Name Road BH1 4HX PostCode
HO056 NaN 23 Street Road XG9 9GX None

Where Address appears as an issue in the column Issues since the address don't match against df_st, same for PostCode since it differs from df_st

IN A NUT SHELL

All I want to to know how to compare matching rows by Reference from a DataFrame against another and compare the other values Address and PostCode.

Hope that makes sense.

Topic data-wrangling pandas python

Category Data Science


You can join df and df_st on Reference:

df_merged = pd.merge(df, df_st, on="Reference", how="left")

Note: The how="left" would really depend on what you want in the joined table. You can then compare the values in the joined columns:

df_merged['Address_df'] == df_merged['Address_df_st']

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.