How to find duplicate rows in a column then find out if two cells in another column sum up to a third cell in an Excel tab in Python?

I need to find all duplicate rows (string values) in Name column and then find out if two numerical values in Amount column sum up to a third value also in the Amount column in an Excel tab in Pandas (Python)? There are two tabs in this worksheet. I'm referring to the second tab called Table2.

For example, in the table below, I have several duplicates in the Name column. But for Richard Madden duplicates, corresponding values in Amount table (-4000) + (-6000) equals (-10000). I need to delete the entire rows for -4000 and -6000 and leave the row for -10000.

Here’s the Excel table: https://i.stack.imgur.com/3n2vZ.png

Here's my code so far:


import pandas as pd

excel = pd.ExcelFile('/Users/user/Downloads/DSR-Table.xlsx')
df1 = pd.read_excel(excel, 'Table2')

dfObj = pd.DataFrame(df1, columns=['Name'])
duplicateRowsDF = dfObj[dfObj.duplicated()]

Topic excel pandas python

Category Data Science


Here's my answer with help of the code above from Jorge N:

import pandas as pd
import itertools

excel = pd.ExcelFile('/Users/user/Downloads/DSR-Table.xlsx')
df1 = pd.read_excel(excel, 'Table2') 
name = pd.DataFrame(df1, columns=['Name'])
amount = pd.DataFrame(df1, columns=['Amount'])
df = pd.DataFrame(list(zip(name, amount)), columns=['Name', 'Amount'])

def get_duplicates_idxs(self):
   idxs=[]
   if len(self)==3:
      amount=self.Amount
      indexes=amount.index
      idx1=indexes[0]
      idx2=indexes[1]
      idx3=indexes[2]
      a1=amount[idx1]
      a2=amount[idx2]
      a3=amount[idx3]
      if a1+a2==a3:
         idxs=[idx1,idx2]
      if a1+a3==a2:
         idxs=[idx1,idx3]
      if a2+a3==a1:
         idxs=[idx2,idx3]
   return idxs

idxs_series=df1.groupby("Name").apply(lambda x: get_duplicates_idxs(x))
idxs_duplicates=(list(itertools.chain.from_iterable(idxs_series)))
df_filtered=df1[~df1.index.isin(idxs_duplicates)]
df_filtered.to_excel("/Users/user/Downloads/DSR-Table.xlsx", index = False)

This is the quickest way to do it in my opinion.

Some made up data based in your image:

import pandas as pd
import itertools
columns=["Name", "Amount"]
name=["Humming","Stanley","James","Humming","Igo","Madden","Madden","Samuels","McCallister","Samuels","Madden"]   
amount=[478028,333543,294376,199793,224,
-4000,-6000,-7886,-9331,-15043,-10000]
extra_name=["Smith","Smith","Smith"]
extra_amount=[3000,2000,-1000]
name+=extra_name
amount+=extra_amount

Table2=list(zip(name,amount))
df1=pd.DataFrame(Table2,columns=columns)

df1 is your real dataframe.

We define a aggregated function to retrieve a list of indexes of duplicates rows for each name according your criteria:

def get_duplicates_idxs(self):
   idxs=[]
   if len(self)==3:
      amount=self.Amount
      indexes=amount.index
      idx1=indexes[0]
      idx2=indexes[1]
      idx3=indexes[2]
      a1=amount[idx1]
      a2=amount[idx2]
      a3=amount[idx3]
      if a1+a2==a3:
         idxs=[idx1,idx2]
      if a1+a3==a2:
         idxs=[idx1,idx3]
      if a2+a3==a1:
         idxs=[idx2,idx3]
   return idxs

There are two assumptions:

  1. Duplicates implies 3 rows for that name.
  2. The order of the 3 rows are irrelevant.

a1, a2 and a3 are the amounts of the posible duplicates rows in that order.

Then we apply this function in df1 grouped by name:

idxs_series=df1.groupby("Name").apply(lambda x: get_duplicates_idxs(x))
idxs_duplicates=(list(itertools.chain.from_iterable(idxs_series)))

df_filtered=df1[~df1.index.isin(idxs_duplicates)]

df_filtered has the desired output.

Tools used:

Pandas groupby Pandas apply itertools.chain Python lambda functions

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.