how to improve my imbalanced data NLP model?

Question

how to improve my imbalanced data NLP model?

Madhur Yadav

2022年4月15日 21:02

I want to classify a patient's health as a prediction probability and get the top 10 most ill patients in a hospital. I have patient's condition notes, medical notes, diagnoses notes, and lab notes for each day.

Current approach -

vectorize all the notes using spacy's scispacy model and sum all the vectors grouped by patient id and day. (200 columns)
find the unit vectors of the above vectors. (200 columns)
use a moving average function on the vectors grouped by patient id and day.(200 columns)
find the unit vectors of the above moving average vectors (200 columns)
combine all the above columns and use them as independent features.
use a lgbm classifier.

The data is imbalanced and the current AUC-ROC is around .78.

What else can I do to improve my AUC-ROC? Can I use bert for this problem? how should I use it?

I'm currently using a moving average as a patient's health deteriorates over time.

Any suggestion/answer/feedback?

Topic allennlp bert language-model nlp

Category Data Science

DataPlug · Accepted Answer · 2021年11月7日 21:03

Your best bet is to use the ktrain python module. There are example notebooks for every NLP task (also very simple and easy to follow along)

I'm not sure if your data is labeled or not. I'm assuming its not labeled therefore, I'd go with text regression example. Alternatively, you can choose text classification example and try to rephrase your problem to somehow incrementally reach the final probability you're going after.

I would also encourage looking through all the examples and finding inspiration on how to specifically tackle your use-case. This module supports autoNLP with a range of data preprocessing tools. Also, you can specifically choose any model from the huggingface library.

how to improve my imbalanced data NLP model?

About