Text Classification misclassifying?
I am trying to solve a binary classification problem. My labels are abusive (1) and non-abusive (0). My dataset was imbalanced (more 1 than 0s) and I used oversampling of the minority label (i.e. 1) to balance my dataset. I have also done pre-processing, feature engineering using TF-IDF and then fed the dataset into a pipeline using 3 classification algorithms namely: Logistic Regression, SVM, and Decision Tree.
My evaluation metrics are:
Logistic Regression:
[[376 33]
[ 18 69]]
precision recall f1-score support
0 0.95 0.92 0.94 409
1 0.68 0.79 0.73 87
accuracy 0.90 496
macro avg 0.82 0.86 0.83 496
weighted avg 0.91 0.90 0.90 496
SVM:
[[383 26]
[ 23 64]]
precision recall f1-score support
0 0.94 0.94 0.94 409
1 0.71 0.74 0.72 87
accuracy 0.90 496
macro avg 0.83 0.84 0.83 496
weighted avg 0.90 0.90 0.90 496
Decision Tree:
[[383 26]
[ 28 59]]
precision recall f1-score support
0 0.93 0.94 0.93 409
1 0.69 0.68 0.69 87
accuracy 0.89 496
macro avg 0.81 0.81 0.81 496
weighted avg 0.89 0.89 0.89 496
The issue I'm facing is that certain new abusive text is being predited as non-abusive. I think that I think my false positive (FP) and false-negative rate (FN) are too high and need to be reduced. Do you have any suggestions on how to reduce FP and FN or any other suggestions to cater for my issue? Thanks.
Topic binary-classification text-classification scikit-learn python
Category Data Science