Predicting High-School test scores after a disciplinary action

I'm somewhat new to machine learning and have learned to apply many of the basic regression and classification methods using python and various packages. However, approaching this problem has me stumped. To illustrate the problem, I created a fictitious scenario where a guidance counselor wants to predict test scores for a student after disciplinary action. Suppose they have data available like the mock-up below:

Column definition:

Student - Student Identification #

Gender - Male/Female

Age - Current Age

Athlete - Sport that student plays (0-None, 1-Basketball, 2-Football, 3-Soccer)

Online - Student takes classes online (0-No, 1-Yes)

Before_Disciplinary_Action_Scores - Sequence of their last n test scores prior to discipline (in date order)

Discplinary_Action - Action Counselor Took (0-None, 1-Assigned Tutor, 2-Guardian Meeting, 3-Study Program, 4-Ineligible for game)

After_Disciplinary_Action_Scores - Sequence of their next n test scores after discipline (in date order)

Assume the fictitious school system is quite large and there are about 80k records in total. All students have a total of 12 test scores from Before/After discipline action, but the number of test scores varies based on when the disciplinary action was given. For simplicity, you can assume the scores are given on a weekly basis for a single course.

I can compute mean before/after scores and create a fairly good classification model, but I'd like to take it a step further and predict the post-discipline scores.

I've tried using Prophet and LSTM models to predict the after scores from a time series approach with poor results and the varying number of scores makes it difficult. I can see that the features around Athlete, Transfer, and Online are important components and I tried adding them as regressors, but that too has failed. I appreciate any guidance you can give.

Topic methods python machine-learning

Category Data Science


I take it you are really interested in the prediction of new test scores rather than answering which intervention method is the most succesfull. If this is not the case we can apply statistical methods rather than machine learning models.

One approach could be to align the test score sequences (input and output) so that they have the same length and a 0 for missing values. This transposes your data into a regular seq2seq problem feedable to an LSTM. Each record is now comparable. Do keep in mind that your model predicts based on how it was trained. It will predict missing values (missed tests) as well like this.

You could then either train on all 80k (minus validation set) sequences or apply windowing to generate more sequences.

Other (one hot encoded) features can be added to each sample as well. Beginning or end.


If your data is correct then i see a factor of gender also. I hope that is taken into account and also take Disp_action as categorical variable (one hot encoding).
Also, try to club pre and post action scores in one fields as output variable. Another field can simply tell if the output score is pre action or post action.
This feature engineering, I hope will give better results.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.