Group a spark dataframe by a starting event to an ending event

Given a series of events (with datetime) such as:

failed, failed, passed, failed, passed, passed

I want to retrieve the time from when it first failed to when it first passed, resetting every time it fails again, as I want to measure the recovery time.

I only succeeded doing this with a for loop, as when I groupBy the event with min in the date I lost the order of events, as I want to group by failed-passed pairs.

Ultimately I want to measure the average recovery time of this test.

Example data:

from pyspark.sql import Row
from datetime import datetime

df = spark.createDataFrame([
  Row(event=failed, date=datetime(2021, 8, 11, 0, 0)),
  Row(event=failed, date=datetime(2021, 8, 12, 0, 0)),
  Row(event=passed, date=datetime(2021, 8, 13, 0, 0)),
  Row(event=failed, date=datetime(2021, 8, 14, 0, 0)),
  Row(event=failed, date=datetime(2021, 8, 15, 0, 0)),
  Row(event=passed, date=datetime(2021, 8, 16, 0, 0)),
  Row(event=passed, date=datetime(2021, 8, 17, 0, 0)),
  Row(event=passed, date=datetime(2021, 8, 18, 0, 0)),
  Row(event=failed, date=datetime(2021, 8, 19, 0, 0)),
  Row(event=passed, date=datetime(2021, 8, 20, 0, 0))
])

df.show()

+------+-------------------+
| event|               date|
+------+-------------------+
|failed|2021-08-11 00:00:00|
|failed|2021-08-12 00:00:00|
|passed|2021-08-13 00:00:00|
|failed|2021-08-14 00:00:00|
|failed|2021-08-15 00:00:00|
|passed|2021-08-16 00:00:00|
|passed|2021-08-17 00:00:00|
|passed|2021-08-18 00:00:00|
|failed|2021-08-19 00:00:00|
|passed|2021-08-20 00:00:00|
+------+-------------------+

expected result:

df = spark.createDataFrame([
  Row(failed=datetime(2021, 8, 11, 0, 0), recovered=datetime(2021, 8, 13, 0, 0)),
  Row(failed=datetime(2021, 8, 14, 0, 0), recovered=datetime(2021, 8, 16, 0, 0)),
  Row(failed=datetime(2021, 8, 19, 0, 0), recovered=datetime(2021, 8, 20, 0, 0)),
])

df.show()

+-------------------+-------------------+
|             failed|          recovered|
+-------------------+-------------------+
|2021-08-11 00:00:00|2021-08-13 00:00:00|
|2021-08-14 00:00:00|2021-08-16 00:00:00|
|2021-08-19 00:00:00|2021-08-20 00:00:00|
+-------------------+-------------------+

Topic dataframe pyspark apache-spark data-cleaning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.