Choosing an explainable embedding and classifier when each document only have one sentence

Question

Choosing an explainable embedding and classifier when each document only have one sentence

Xiiryo

2021年8月25日 16:55

I have dataset with corpus of 20K documents.

Each document is a short 1 sentences.

I need to classify each sentence in 0/1 classes as well as being able to point exactly what words are responsible for that.

To make it more concrete, one of the classes is Unclear vs Clear. The user make a request and we try to guess if his request is clear enough to be understood and processed by someone else.

Then we want to show him what words were responsible for the feedback.

Please take the red thing into the bin. Unclear

Please be at work at 8 A.M., tomorrow. Clear

Try to be less strange. Unclear

Our current designs ideas are:

Bag of words
With binary weighting of tokens (0/1)
Using both words and the most common n-grams as tokens
Feature selection with Univariate
Logistic regression

Or:

Bag of words
With binary weighting of tokens (0/1)
Using only words as token
Use 2 ways ANOVA as a predictive model

Or:

Embedding each Part of Speech with Fasttext/Word2Vec average
Create a fixed length vector of parts of speech, fill with zeros
Feature selection with Univariate
Logistic regression

Or:

BERT classifier
SHAPE for explainability
Tested, works well but very computer intensive

If you have some experience in the field, do you have some ideas for that could be relevant?

Thanks

Topic bag-of-words text-classification word2vec classification nlp

Category Data Science

Choosing an explainable embedding and classifier when each document only have one sentence

About