Naive Bayes TfidfVectorizer predicts everything to one class
I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this:
text counts: NEU NEG POS Total number
NEU balance dataset 5894 2474 1632 10000
NEG balance dataset 2474 2474 1632 6580
POS balance dataset 1632 1632 1632 4896
The problem is when I try to classify. Everything is okay on every dataset except NEU. when i classify NEU balanced dataset with countvectorizer it runs okay. here is confusion matrix:
[[ 231 247 17]
[ 104 1004 71]
[ 24 211 91]]
But when i use TfidfVectorizer model predicts everything to NEU class.
[[ 1 494 0]
[ 0 1179 0]
[ 0 326 0]]
Here is some of my code:
sentences = svietimas_data['text']
y = svietimas_data['sentiment']
#vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer(lowercase=False)
vectorizer.fit(sentences)
sentences = vectorizer.transform(sentences)
X_train, X_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2, random_state=42, stratify= y)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
I have an idea that reason of that is because that NEU balanced dataset is worst balanced. But question is why using countvectorizer model is predicting fine?
Topic text-classification tfidf naive-bayes-classifier classification python
Category Data Science