Computing number of business days between start/end columns

I have two Dataframes

  • facts:
    • columns: data, start_date and end_date
  • holidays:
    • column: holiday_date

What I want is a way to produce another Dataframe that has columns: data, start_date, end_date and num_holidays

Where num_holidays is computed as: Number of days between start and end that are not weekends or holidays (as in the holidays table).

The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:

    --Calculate and return the number of workdays using the input parameters.
    --This is the meat of the function.
    --This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
    RETURN (
        SELECT
        --Start with total number of days including weekends
        (DATEDIFF(dd,@StartDate, @EndDate)+1)
        --Subtact 2 days for each full weekend
        -(DATEDIFF(wk,@StartDate, @EndDate)*2)
        --If StartDate is a Sunday, Subtract 1
        -(CASE WHEN DATENAME(dw, @StartDate) = 'Sunday'
            THEN 1
            ELSE 0
        END)
        --If EndDate is a Saturday, Subtract 1
        -(CASE WHEN DATENAME(dw, @EndDate) = 'Saturday'
            THEN 1
            ELSE 0
        END)
        --Subtract all holidays
        -(Select Count(*) from [dbo].[tblHolidays]
          where  [HolDate] between @StartDate and @EndDate )
        )
    END

I'm new to pyspark and was wondering what's the efficient way to do this?

  • Is there a better way than creating a UDF that reads the holidays table in a Dataframe and joins with it to count the holidays?
  • Is there a way to write a pandas_udf instead? Would it be faster enough?
  • Are there some optimizations I can apply like cache the holidays table somehow on every worker?

Topic pyspark apache-spark sql

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.