Compare multiple values from a DataFrame against a single row from another
I'm trying to compare address values for inaccuracies, for example, given multiple records like:
| Reference | Apartment | Address | PostCode |
|---|---|---|---|
| AS097 | NaN | 00 Name Road | BH1 4HB |
| AS097 | Flat 1 Building Name | 00 Name Road | BH1 4HB |
| AS097 | Flat 2 Building Name | 00 Name Rd | BH1 4HB |
| AS097 | Flat 3 Building Name | 00 Name Road | BH1 4HB |
| AS097 | Flat 4 Building Name | 00 Name Road | BH1 4HB |
| AS097 | Flat 5 Building Name | 00 Name Road | BH1 4HX |
| HO056 | NaN | 23 Street Road | XG9 9GX |
I've a dataframe where I store all the main addresses by checking if the Apartment column is empty like so:
main_address = df['Apartment'].isnull()
df_st = pd.DataFrame({'Reference':df[main_address].Reference, 'Address':df[main_address].Address, 'PostCode':df[main_address].PostCode})
df_st will look like this:
| Reference | Apartment | Address | PostCode |
|---|---|---|---|
| AS097 | NaN | 00 Name Road | BH1 4HB |
| HO056 | NaN | 23 Street Road | XG9 9GX |
df has over 1K records, but df_st containing the main address ends up with approx. 200 records.
I'm trying to create a new DataFrame where I can identify where the records don't match by domparing df against df_st.
THE PROBLEM
I try the below:
# Clean the Reference values
refs_list = df['Reference'].str.split('/').str[0]
df['Reference'] = refs_list
# Create a new column titled issues and flag if the references match
df['issues'] = np.where(df['Reference'] == df_st['Reference'], 'True', 'False')
I want the above for the Address and PostCode unfortunately this does not work since df and df_st don't have the same shape.
I'm struggling to find a way to achieve a comparison between the two DataFrames df against df_st.
I want to compare all matching Reference row values from df against it's matching from df_st and if one of them don't match create a new column title Issues and store the conflicting column there.
MY DESIRED OUTCOME
Given the data above, after comparing df data against df_st results in a new DataFrame, like below
| Reference | Apartment | Address | PostCode | Issues |
|---|---|---|---|---|
| AS097 | NaN | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 1 Building Name | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 2 Building Name | 00 Name Rd | BH1 4HB | Address |
| AS097 | Flat 3 Building Name | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 4 Building Name | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 5 Building Name | 00 Name Road | BH1 4HX | PostCode |
| HO056 | NaN | 23 Street Road | XG9 9GX | None |
Where Address appears as an issue in the column Issues since the address don't match against df_st, same for PostCode since it differs from df_st
IN A NUT SHELL
All I want to to know how to compare matching rows by Reference from a DataFrame against another and compare the other values Address and PostCode.
Hope that makes sense.
Topic data-wrangling pandas python
Category Data Science