Group a spark dataframe by a starting event to an ending event
Given a series of events (with datetime) such as:
failed, failed, passed, failed, passed, passed
I want to retrieve the time from when it first failed to when it first passed, resetting every time it fails again, as I want to measure the recovery time.
I only succeeded doing this with a for loop, as when I groupBy
the event with min
in the date I lost the order of events, as I want to group by failed-passed pairs.
Ultimately I want to measure the average recovery time of this test.
Example data:
from pyspark.sql import Row
from datetime import datetime
df = spark.createDataFrame([
Row(event=failed, date=datetime(2021, 8, 11, 0, 0)),
Row(event=failed, date=datetime(2021, 8, 12, 0, 0)),
Row(event=passed, date=datetime(2021, 8, 13, 0, 0)),
Row(event=failed, date=datetime(2021, 8, 14, 0, 0)),
Row(event=failed, date=datetime(2021, 8, 15, 0, 0)),
Row(event=passed, date=datetime(2021, 8, 16, 0, 0)),
Row(event=passed, date=datetime(2021, 8, 17, 0, 0)),
Row(event=passed, date=datetime(2021, 8, 18, 0, 0)),
Row(event=failed, date=datetime(2021, 8, 19, 0, 0)),
Row(event=passed, date=datetime(2021, 8, 20, 0, 0))
])
df.show()
+------+-------------------+
| event| date|
+------+-------------------+
|failed|2021-08-11 00:00:00|
|failed|2021-08-12 00:00:00|
|passed|2021-08-13 00:00:00|
|failed|2021-08-14 00:00:00|
|failed|2021-08-15 00:00:00|
|passed|2021-08-16 00:00:00|
|passed|2021-08-17 00:00:00|
|passed|2021-08-18 00:00:00|
|failed|2021-08-19 00:00:00|
|passed|2021-08-20 00:00:00|
+------+-------------------+
expected result:
df = spark.createDataFrame([
Row(failed=datetime(2021, 8, 11, 0, 0), recovered=datetime(2021, 8, 13, 0, 0)),
Row(failed=datetime(2021, 8, 14, 0, 0), recovered=datetime(2021, 8, 16, 0, 0)),
Row(failed=datetime(2021, 8, 19, 0, 0), recovered=datetime(2021, 8, 20, 0, 0)),
])
df.show()
+-------------------+-------------------+
| failed| recovered|
+-------------------+-------------------+
|2021-08-11 00:00:00|2021-08-13 00:00:00|
|2021-08-14 00:00:00|2021-08-16 00:00:00|
|2021-08-19 00:00:00|2021-08-20 00:00:00|
+-------------------+-------------------+
Topic dataframe pyspark apache-spark data-cleaning
Category Data Science