How to use multiple cross-section observations per subject for churn prediction?

Question

How to use multiple cross-section observations per subject for churn prediction?

Márton Szekeres

2020年7月13日 20:18

Recently I have started to teach myself about machine learning and I have ran into a dataset, which got me a bit confused.

Dataset: The subjects of the dataset are university students (student ID == Key feature), and each observation is a summary of their semester (grade averages, ECTS taken and completed, etc.) plus their general programme-related data (enrollment and scholarship status, date of enrollment, programme code, etc.). The data is in hungarian, but in the context of the issue, it is not important to understand the meaning of the feature names and values. Below is an example of an observation:

My goal: I want to build a model, which predicts student churn.

Problem: The dataset contains a single or multiple observations per student, based on the number of university semesters, and the observation periods are not consistent between the students, since it is based on the individual date of enrollment.

In the example picture above, you can see, that student no.1 has 7 observations (=7 semesters completed) and started his programme on 2009.09.10 (Képzés jogviszony kezdete == Date of programme enrollment), while student no.2 has 3 observations and started his programme on 2008.09.12.

I am wondering, should I use only one observation (e.g.: the last completed university semester) per student ,or does it make sense to use all observations per student?

Thank you for the feedbacks in advance!

(Also, I am new on the forum, so if you have any constructive criticism regarding the content and the format of my question, please, share with me.)

Topic churn python machine-learning

Category Data Science

Josh · Accepted Answer · 2020年7月13日 20:18

If you know SQL, then my method might work. I'll break it out into subqueries to make it a little more understandable. I don't know your columns so this might not translate perfectly.

The point is to have 1 query identify all active students at a particular point in time. The point of time you choose is arbitrary, but it must have enough time forwards enough to the end of your data so that you can establish a forward-looking window to satisfy churn or not. I am calling this the snapshot_date. Don't have students who churned before the shapshot_date in this resultset at all.

Then, you join this to a subquery which uses that shapshot_date and looks forward for the churn condition. This returns only students who churned, but churned within the window that you decide. If they churned in the future beyond your window, treat them as if they have not churned.

The very first query that returns a list of active students might still return multiple rows for each student. You handle this by deciding how to aggregate those rows into a single row for each customer. You might end up creating a lot of good variables for your model this way. I added a few as an example.

Then, your SQL would look similar to this:

SELECT active_students.*, 
case when student_churned is not null then 1 else 0 end as target_student_churn
FROM
    (
        SELECT Key,
        DateDiff(snapshot_date, date_of_enrollment, "Days") as time_since_enrollment,
        min(date_of_enrollment) as date_of_enrollment,
        max(felev) as max_felev,
        COUNT(DISTINCT course) as distinct_courses_taken,
        ...
        FROM table
        WHERE date_of_enrollment < snapshot_date
        AND {student has not churned as of snapshot date}
        GROUP BY Key
    ) as active_students
LEFT OUTER JOIN 
    (
    SELECT Key
    FROM table
    WHERE {student churned within hard coded time period AFTER snapshot date}
    GROUP BY Key
    ) as student_churned
ON active_students.Key = student_churned.Key;

You may wonder then how this accounts for seasonality over time if you chose your snapshot date randomly. Well, just change the snapshot_date and repeat while you continue to stack the results on top of one another. Having the snapshot date as a column will help you extract the "season" from each resultset.

Now the students might be repeated again in the final dataset. So, when you build the model you want to make sure that the students' rows fall into the same validation or holdout dataset. So use GroupKFold or something like that. I hope this helps.

How to use multiple cross-section observations per subject for churn prediction?

About