Text Classification misclassifying?

I am trying to solve a binary classification problem. My labels are abusive (1) and non-abusive (0). My dataset was imbalanced (more 1 than 0s) and I used oversampling of the minority label (i.e. 1) to balance my dataset. I have also done pre-processing, feature engineering using TF-IDF and then fed the dataset into a pipeline using 3 classification algorithms namely: Logistic Regression, SVM, and Decision Tree.

My evaluation metrics are:

    Logistic Regression:
[[376  33]
 [ 18  69]]
          precision    recall  f1-score   support

       0       0.95      0.92      0.94       409
       1       0.68      0.79      0.73        87

accuracy                               0.90       496
macro avg          0.82      0.86      0.83       496
weighted avg       0.91      0.90      0.90       496

    SVM:
[[383  26]
 [ 23  64]]
          precision    recall  f1-score   support

       0       0.94      0.94      0.94       409
       1       0.71      0.74      0.72        87

accuracy                               0.90       496
macro avg          0.83      0.84      0.83       496
weighted avg       0.90      0.90      0.90       496

    Decision Tree:
[[383  26]
 [ 28  59]]
          precision    recall  f1-score   support

       0       0.93      0.94      0.93       409
       1       0.69      0.68      0.69        87

accuracy                               0.89       496
macro avg          0.81      0.81      0.81       496
weighted avg       0.89      0.89      0.89       496

The issue I'm facing is that certain new abusive text is being predited as non-abusive. I think that I think my false positive (FP) and false-negative rate (FN) are too high and need to be reduced. Do you have any suggestions on how to reduce FP and FN or any other suggestions to cater for my issue? Thanks.

Topic binary-classification text-classification scikit-learn python

Category Data Science


All 3 algorithms are giving very similar results. And looking at evaluation sample size I think training sample is not too large. It indicates me that if there is any opportunity, it's a) in feature engineering b) in not predicting for less confident cases c) getting more data to train more complex algo

a) Feature Engineering - TFIDF or count vectorizer has a real world problem of having test words outside of training vocabulary. If you can use general language vocabulary for embedding both train and test sets, then results should improve. There are open source pretrained embeddings like USE, Glove etc. to do so.

b) Prediction Confidence - Along with class prediction you can also get probability of classification. Then check - below which cutoff of probability your F1 score gets too low. Don't predict for those low probability cases. Most practical systems accept limitation of AI.

c) More data will allow you to learn through more complex algorithms like boosting which may improve results. Hopefully you are already crossvalidating.

Also, based on the cost of error, you can decide whether FP or FN should be prioritized. Accordingly optimize precision/ recall.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.