Compare multiple values from a DataFrame against a single row from another

Question

Compare multiple values from a DataFrame against a single row from another

Ricardo Sanchez

2022年2月25日 01:04

I'm trying to compare address values for inaccuracies, for example, given multiple records like:

Reference	Apartment	Address	PostCode
AS097	NaN	00 Name Road	BH1 4HB
AS097	Flat 1 Building Name	00 Name Road	BH1 4HB
AS097	Flat 2 Building Name	00 Name Rd	BH1 4HB
AS097	Flat 3 Building Name	00 Name Road	BH1 4HB
AS097	Flat 4 Building Name	00 Name Road	BH1 4HB
AS097	Flat 5 Building Name	00 Name Road	BH1 4HX
HO056	NaN	23 Street Road	XG9 9GX

I've a dataframe where I store all the main addresses by checking if the Apartment column is empty like so:

main_address = df['Apartment'].isnull()

df_st = pd.DataFrame({'Reference':df[main_address].Reference, 'Address':df[main_address].Address, 'PostCode':df[main_address].PostCode})

df_st will look like this:

Reference	Apartment	Address	PostCode
AS097	NaN	00 Name Road	BH1 4HB
HO056	NaN	23 Street Road	XG9 9GX

df has over 1K records, but df_st containing the main address ends up with approx. 200 records.

I'm trying to create a new DataFrame where I can identify where the records don't match by domparing df against df_st.

THE PROBLEM

I try the below:

# Clean the Reference values
refs_list = df['Reference'].str.split('/').str[0]
df['Reference'] = refs_list

# Create a new column titled issues and flag if the references match
df['issues'] = np.where(df['Reference'] == df_st['Reference'], 'True', 'False')

I want the above for the Address and PostCode unfortunately this does not work since df and df_st don't have the same shape.

I'm struggling to find a way to achieve a comparison between the two DataFrames df against df_st.

I want to compare all matching Reference row values from df against it's matching from df_st and if one of them don't match create a new column title Issues and store the conflicting column there.

MY DESIRED OUTCOME

Given the data above, after comparing df data against df_st results in a new DataFrame, like below

Reference	Apartment	Address	PostCode	Issues
AS097	NaN	00 Name Road	BH1 4HB	None
AS097	Flat 1 Building Name	00 Name Road	BH1 4HB	None
AS097	Flat 2 Building Name	00 Name Rd	BH1 4HB	Address
AS097	Flat 3 Building Name	00 Name Road	BH1 4HB	None
AS097	Flat 4 Building Name	00 Name Road	BH1 4HB	None
AS097	Flat 5 Building Name	00 Name Road	BH1 4HX	PostCode
HO056	NaN	23 Street Road	XG9 9GX	None

Where Address appears as an issue in the column Issues since the address don't match against df_st, same for PostCode since it differs from df_st

IN A NUT SHELL

All I want to to know how to compare matching rows by Reference from a DataFrame against another and compare the other values Address and PostCode.

Hope that makes sense.

Topic data-wrangling pandas python

Category Data Science

Daren · Accepted Answer · 2021年5月11日 02:34

You can join df and df_st on Reference:

df_merged = pd.merge(df, df_st, on="Reference", how="left")

Note: The how="left" would really depend on what you want in the joined table. You can then compare the values in the joined columns:

df_merged['Address_df'] == df_merged['Address_df_st']

Compare multiple values from a DataFrame against a single row from another

About