How to use multiple cross-section observations per subject for churn prediction?
Recently I have started to teach myself about machine learning and I have ran into a dataset, which got me a bit confused.
Dataset: The subjects of the dataset are university students (student ID == Key feature), and each observation is a summary of their semester (grade averages, ECTS taken and completed, etc.) plus their general programme-related data (enrollment and scholarship status, date of enrollment, programme code, etc.). The data is in hungarian, but in the context of the issue, it is not important to understand the meaning of the feature names and values. Below is an example of an observation:
My goal: I want to build a model, which predicts student churn.
Problem: The dataset contains a single or multiple observations per student, based on the number of university semesters, and the observation periods are not consistent between the students, since it is based on the individual date of enrollment.
In the example picture above, you can see, that student no.1 has 7 observations (=7 semesters completed) and started his programme on 2009.09.10 (Képzés jogviszony kezdete == Date of programme enrollment), while student no.2 has 3 observations and started his programme on 2008.09.12.
I am wondering, should I use only one observation (e.g.: the last completed university semester) per student ,or does it make sense to use all observations per student?
Thank you for the feedbacks in advance!
(Also, I am new on the forum, so if you have any constructive criticism regarding the content and the format of my question, please, share with me.)
Topic churn python machine-learning
Category Data Science