How can I calculate total days past due between billing events?

Question

How can I calculate total days past due between billing events?

datadummy

2022年5月15日 02:05

I am dealing with a dataframe with subscription events partitioned by username, subscription status, and relative timestamps. For each of the dates, there are changes in time when the subscription becomes past due and renews as such:

username	subscription_events_name	subscription_events_timestamp
A	subscription_charged_unsuccess	2021-01-08
A	subscription_past_due	2021-01-08
A	subscription_past_due	2021-01-15
A	subscription_charged_successfully	2021-01-16
A	subscription_renew	2021-01-16

Say a customer enters past due status, and 15 days later their subscription is billed in full and they enter an active status. I want to store this value in a new column for days_past_due:

username	subscription_events_name	subscription_events_timestamp	days_past_due
A	subscription_charged_unsuccess	2021-01-08	0
A	subscription_past_due	2021-01-08	0
A	subscription_past_due	2021-01-15	7
A	subscription_charged_successfully	2021-01-16	0
A	subscription_renew	2021-01-16	0

I created an expanding window for this, but it only counts values of subscription_events_name. How can I modify this code or produce new code to achieve this?

df['days_past_due'] = df.groupby([
'username', 
'subscription_events_name')[['subscription_events_timestamp']].transform(lambda x: x.expanding().count())
df.sort_values(by=['subscription_events_timestamp'])

Topic pandas python data-cleaning

Category Data Science

Carmoreno · Accepted Answer · 2021年11月10日 00:38

I think you can use another approach, I propose this one:

Convert to datetime the subscription_events_timestamp column.

df['subscription_events_timestamp'] = df['subscription_events_timestamp'].apply(lambda date: pd.to_datetime(date, format='%Y-%m-%d'))

Using .diff() method, you can get the difference between the current date and the date immediately below, once you get the difference, you can use dt.days to express it as days.
```
df['days_past_due'] = df.groupby(['username', 'subscription_events_name'])['subscription_events_timestamp'].diff().dt.days
```
Using fillna() method you can replace the NaN values got in the previous step by 0
```
df['days_past_due'].fillna(0, inplace=True)
```

Output:

	username	subscription_events_name	subscription_events_timestamp	days_past_due
0	A	subscription_charged_unsuccess	2021-01-08	0.0
1	A	subscription_past_due	2021-01-08	0.0
2	A	subscription_past_due	2021-01-15	7.0
3	A	subscription_charged_successfully	2021-01-16	0.0
4	A	subscription_renew	2021-01-16	0.0

How can I calculate total days past due between billing events?

About