How can I calculate total days past due between billing events?

I am dealing with a dataframe with subscription events partitioned by username, subscription status, and relative timestamps. For each of the dates, there are changes in time when the subscription becomes past due and renews as such:

username subscription_events_name subscription_events_timestamp
A subscription_charged_unsuccess 2021-01-08
A subscription_past_due 2021-01-08
A subscription_past_due 2021-01-15
A subscription_charged_successfully 2021-01-16
A subscription_renew 2021-01-16

Say a customer enters past due status, and 15 days later their subscription is billed in full and they enter an active status. I want to store this value in a new column for days_past_due:

username subscription_events_name subscription_events_timestamp days_past_due
A subscription_charged_unsuccess 2021-01-08 0
A subscription_past_due 2021-01-08 0
A subscription_past_due 2021-01-15 7
A subscription_charged_successfully 2021-01-16 0
A subscription_renew 2021-01-16 0

I created an expanding window for this, but it only counts values of subscription_events_name. How can I modify this code or produce new code to achieve this?

df['days_past_due'] = df.groupby([
'username', 
'subscription_events_name')[['subscription_events_timestamp']].transform(lambda x: x.expanding().count())
df.sort_values(by=['subscription_events_timestamp'])

Topic pandas python data-cleaning

Category Data Science


I think you can use another approach, I propose this one:

  1. Convert to datetime the subscription_events_timestamp column.

    df['subscription_events_timestamp'] = df['subscription_events_timestamp'].apply(lambda date: pd.to_datetime(date, format='%Y-%m-%d'))
    
  2. Using .diff() method, you can get the difference between the current date and the date immediately below, once you get the difference, you can use dt.days to express it as days.

    df['days_past_due'] = df.groupby(['username', 'subscription_events_name'])['subscription_events_timestamp'].diff().dt.days
    
  3. Using fillna() method you can replace the NaN values got in the previous step by 0

    df['days_past_due'].fillna(0, inplace=True)
    

Output:

username subscription_events_name subscription_events_timestamp days_past_due
0 A subscription_charged_unsuccess 2021-01-08 0.0
1 A subscription_past_due 2021-01-08 0.0
2 A subscription_past_due 2021-01-15 7.0
3 A subscription_charged_successfully 2021-01-16 0.0
4 A subscription_renew 2021-01-16 0.0

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.