how to improve my imbalanced data NLP model?

I want to classify a patient's health as a prediction probability and get the top 10 most ill patients in a hospital. I have patient's condition notes, medical notes, diagnoses notes, and lab notes for each day.

Current approach -

  1. vectorize all the notes using spacy's scispacy model and sum all the vectors grouped by patient id and day. (200 columns)
  2. find the unit vectors of the above vectors. (200 columns)
  3. use a moving average function on the vectors grouped by patient id and day.(200 columns)
  4. find the unit vectors of the above moving average vectors (200 columns)
  5. combine all the above columns and use them as independent features.
  6. use a lgbm classifier.

The data is imbalanced and the current AUC-ROC is around .78.

What else can I do to improve my AUC-ROC? Can I use bert for this problem? how should I use it?

I'm currently using a moving average as a patient's health deteriorates over time.

Any suggestion/answer/feedback?

Topic allennlp bert language-model nlp

Category Data Science


Your best bet is to use the ktrain python module. There are example notebooks for every NLP task (also very simple and easy to follow along)

I'm not sure if your data is labeled or not. I'm assuming its not labeled therefore, I'd go with text regression example. Alternatively, you can choose text classification example and try to rephrase your problem to somehow incrementally reach the final probability you're going after.

I would also encourage looking through all the examples and finding inspiration on how to specifically tackle your use-case. This module supports autoNLP with a range of data preprocessing tools. Also, you can specifically choose any model from the huggingface library.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.