I can't figure out how to improve accuracy for tweet sentiment
I'm doing a beginning attempt at tweet sentiment analysis (positive, neutral, negative). So far I have cleaned the data and used a BoW to get some feeling of the data (2.5k tweets). I also made bigrams to try to get clearer sentiment insight.
The data is severely skewed so I tried both upsampling and downsampling to view the difference.
I finally passed it all through a Random Forest Classifier and I get an accuracy of 0.7 for the upsampled data and 0.3 for the downsampled one.
I visualized this in a confusion matrix and I can see that the model sucks at actually labeling correctly. I retrieved Precision, Recall, and F1. I can see I have problems with the positive and negative sentiments above all (values are 0.45)
I have tried going back to cleaning the data but at this point, I can't think of anything else to do to it (I've run stemming, lemma, tokenize, stopwords and added stopwords that were left there, removed special characters (@, #, etc), and hyperlinks.
I also gave my countvectorizer a range of ngrams of 1,1; 2,2; 3,3; but no big change is detected.
This is my first time doing this, can anybody point me in the right direction here please?
Topic beginner random-forest sentiment-analysis
Category Data Science