Set value for column based on two other columns in pandas dataframe

I have a dataframe that has contracts with different order dates and I need to create a new column that assign a number to each contract if it has more than one order date. For example my sample dataframe looks something like this:

df = pd.DataFrame({'contract': ['123A','123A','123A','123A','123B','123B','123C'],'prod': ['X1','M1','V1','D1','A1','B1','C1'],'date':['2019-04-17','2019-07-02','2019-04-17','2019-07-02','2019-04-17','2019-09-01','2019-08-02'],'revenue': [5688,113932,5688,49157,5002,892,9000]})

I need my final table to have another column with a unique contract id for each date. My final table from above should look something like this:

contract date header_contract
123A 2019-04-17 123A_0
123A 2019-07-02 123A_1
123A 2019-04-17 123A_0
123A 2019-08-02 123A_2

I have the following code that does what I need on a smaller dataset:

contracts_num = df['contract'].unique()
for cm in contracts_num:
    for idx,val in enumerate(df[df['contract'] == cm]['contract'].dt.date.unique()):
        df.loc[((df['contract'] == cm)  (df['contract'] == str(val))),'contract'] = df['contract'] + '_' + str(idx)

I'm trying to do it on a much larger dataset (around 50,000 contracts) and it's taking a really long time. Is there anyway to make it more efficient?

Topic pandas python efficiency

Category Data Science


You can use groupby together with shift and cumsum as follows:

df['header_contract'] = df['contract'] + '_' + df.sort_values(['contract', 'date']).\
  groupby('contract')["date"].\
  apply(lambda x: (x.shift() != x).cumsum()).astype(str)

In the apply, x.shift() != x is used to create a new series of booleans corresponding to if the date has changed in the next row or not. cumsum will then create a cumulative sum (treating all True as 1) which creates the suffixes for each group. This is then merged with the contract names to create the new column.

Result:

  contract prod       date  revenue header_contract
0     123A   X1 2019-04-17     5688          123A_1
1     123A   M1 2019-07-02   113932          123A_2
2     123A   V1 2019-04-17     5688          123A_1
3     123A   D1 2019-07-02    49157          123A_2
4     123B   A1 2019-04-17     5002          123B_1
5     123B   B1 2019-09-01      892          123B_2
6     123C   C1 2019-08-02     9000          123C_1

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.