Choosing an explainable embedding and classifier when each document only have one sentence

I have dataset with corpus of 20K documents.

Each document is a short 1 sentences.

I need to classify each sentence in 0/1 classes as well as being able to point exactly what words are responsible for that.

To make it more concrete, one of the classes is Unclear vs Clear. The user make a request and we try to guess if his request is clear enough to be understood and processed by someone else.

Then we want to show him what words were responsible for the feedback.

Please take the red thing into the bin. Unclear

Please be at work at 8 A.M., tomorrow. Clear

Try to be less strange. Unclear

Our current designs ideas are:

  • Bag of words
  • With binary weighting of tokens (0/1)
  • Using both words and the most common n-grams as tokens
  • Feature selection with Univariate
  • Logistic regression

Or:

  • Bag of words
  • With binary weighting of tokens (0/1)
  • Using only words as token
  • Use 2 ways ANOVA as a predictive model

Or:

  • Embedding each Part of Speech with Fasttext/Word2Vec average
  • Create a fixed length vector of parts of speech, fill with zeros
  • Feature selection with Univariate
  • Logistic regression

Or:

  • BERT classifier
  • SHAPE for explainability
  • Tested, works well but very computer intensive

If you have some experience in the field, do you have some ideas for that could be relevant?

Thanks

Topic bag-of-words text-classification word2vec classification nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.