how to improve my imbalanced data NLP model?
I want to classify a patient's health as a prediction probability and get the top 10 most ill patients in a hospital. I have patient's condition notes, medical notes, diagnoses notes, and lab notes for each day.
Current approach -
- vectorize all the notes using spacy's scispacy model and sum all the vectors grouped by patient id and day. (200 columns)
- find the unit vectors of the above vectors. (200 columns)
- use a moving average function on the vectors grouped by patient id and day.(200 columns)
- find the unit vectors of the above moving average vectors (200 columns)
- combine all the above columns and use them as independent features.
- use a lgbm classifier.
The data is imbalanced and the current AUC-ROC is around .78.
What else can I do to improve my AUC-ROC? Can I use bert for this problem? how should I use it?
I'm currently using a moving average as a patient's health deteriorates over time.
Any suggestion/answer/feedback?
Topic allennlp bert language-model nlp
Category Data Science