Compare multiple values from a DataFrame against a single row from another
I'm trying to compare address values for inaccuracies, for example, given multiple records like:
Reference | Apartment | Address | PostCode |
---|---|---|---|
AS097 | NaN | 00 Name Road | BH1 4HB |
AS097 | Flat 1 Building Name | 00 Name Road | BH1 4HB |
AS097 | Flat 2 Building Name | 00 Name Rd | BH1 4HB |
AS097 | Flat 3 Building Name | 00 Name Road | BH1 4HB |
AS097 | Flat 4 Building Name | 00 Name Road | BH1 4HB |
AS097 | Flat 5 Building Name | 00 Name Road | BH1 4HX |
HO056 | NaN | 23 Street Road | XG9 9GX |
I've a dataframe where I store all the main addresses by checking if the Apartment column is empty like so:
main_address = df['Apartment'].isnull()
df_st = pd.DataFrame({'Reference':df[main_address].Reference, 'Address':df[main_address].Address, 'PostCode':df[main_address].PostCode})
df_st
will look like this:
Reference | Apartment | Address | PostCode |
---|---|---|---|
AS097 | NaN | 00 Name Road | BH1 4HB |
HO056 | NaN | 23 Street Road | XG9 9GX |
df
has over 1K records, but df_st
containing the main address ends up with approx. 200 records.
I'm trying to create a new DataFrame where I can identify where the records don't match by domparing df
against df_st
.
THE PROBLEM
I try the below:
# Clean the Reference values
refs_list = df['Reference'].str.split('/').str[0]
df['Reference'] = refs_list
# Create a new column titled issues and flag if the references match
df['issues'] = np.where(df['Reference'] == df_st['Reference'], 'True', 'False')
I want the above for the Address
and PostCode
unfortunately this does not work since df
and df_st
don't have the same shape.
I'm struggling to find a way to achieve a comparison between the two DataFrames df
against df_st
.
I want to compare all matching Reference
row values from df
against it's matching from df_st
and if one of them don't match create a new column title Issues
and store the conflicting column there.
MY DESIRED OUTCOME
Given the data above, after comparing df
data against df_st
results in a new DataFrame, like below
Reference | Apartment | Address | PostCode | Issues |
---|---|---|---|---|
AS097 | NaN | 00 Name Road | BH1 4HB | None |
AS097 | Flat 1 Building Name | 00 Name Road | BH1 4HB | None |
AS097 | Flat 2 Building Name | 00 Name Rd | BH1 4HB | Address |
AS097 | Flat 3 Building Name | 00 Name Road | BH1 4HB | None |
AS097 | Flat 4 Building Name | 00 Name Road | BH1 4HB | None |
AS097 | Flat 5 Building Name | 00 Name Road | BH1 4HX | PostCode |
HO056 | NaN | 23 Street Road | XG9 9GX | None |
Where Address
appears as an issue in the column Issues
since the address don't match against df_st
, same for PostCode
since it differs from df_st
IN A NUT SHELL
All I want to to know how to compare matching rows by Reference
from a DataFrame against another and compare the other values Address
and PostCode
.
Hope that makes sense.
Topic data-wrangling pandas python
Category Data Science