Naive Bayes TfidfVectorizer predicts everything to one class

Question

Naive Bayes TfidfVectorizer predicts everything to one class

Justas Vasiljevas

2022年5月1日 14:47

I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this:

     text counts:  NEU   NEG  POS  Total number

NEU balance dataset 5894 2474 1632 10000

NEG balance dataset 2474 2474 1632 6580

POS balance dataset 1632 1632 1632 4896

The problem is when I try to classify. Everything is okay on every dataset except NEU. when i classify NEU balanced dataset with countvectorizer it runs okay. here is confusion matrix:

[[ 231 247 17]

[ 104 1004 71]

[ 24 211 91]]

But when i use TfidfVectorizer model predicts everything to NEU class.

[[ 1 494 0]

[ 0 1179 0]

[ 0 326 0]]

Here is some of my code:

    sentences = svietimas_data['text']
    y = svietimas_data['sentiment']
    
    #vectorizer = CountVectorizer()
    vectorizer = TfidfVectorizer(lowercase=False)
    vectorizer.fit(sentences)
    sentences = vectorizer.transform(sentences)
    
    X_train, X_test, y_train, y_test = train_test_split(sentences, y, test_size=0.2, random_state=42, stratify= y)
    
    classifier = MultinomialNB()
    
    classifier.fit(X_train, y_train)
    
    y_pred = classifier.predict(X_test)
    print(classification_report(y_test, y_pred))
    print(accuracy_score(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

I have an idea that reason of that is because that NEU balanced dataset is worst balanced. But question is why using countvectorizer model is predicting fine?

Topic text-classification tfidf naive-bayes-classifier classification python

Category Data Science

Naive Bayes TfidfVectorizer predicts everything to one class

About