Naive Bayes Predict type = 'raw' returning NA

I have build a naive bayes model for text classification.It is predicting correctly.But it is returning 'NA' in prediction results if i put 'type = raw'.i have seen some results in stackoverflow to add some noise.when i do that i am getting all A category as 0's and all B category as 1's.How can i get correct probabilities in naive bayes?

library('tm');
library('e1071');
library('SparseM');
Sample_data - read.csv("products.csv");
traindata - as.data.frame(Sample_data[1:60,c(1,2)]);
testdata - as.data.frame(Sample_data[61:80,c(1,2)]);
trainvector - as.vector(traindata$Description);
testvector - as.vector(testdata$Description);
trainsource - VectorSource(trainvector);
testsource - VectorSource(testvector);
traincorpus - Corpus(trainsource);
testcorpus - Corpus(testsource);
traincorpus - tm_map(traincorpus,stripWhitespace);
 traincorpus - tm_map(traincorpus,tolower);
 traincorpus - tm_map(traincorpus, removeWords,stopwords("english"));
traincorpus- tm_map(traincorpus,removePunctuation);
 testcorpus - tm_map(testcorpus,stripWhitespace);
 testcorpus - tm_map(testcorpus,tolower);
 testcorpus - tm_map(testcorpus, removeWords,stopwords("english"));
 testcorpus- tm_map(testcorpus,removePunctuation);
trainmatrix - t(TermDocumentMatrix(traincorpus));
testmatrix - t(TermDocumentMatrix(testcorpus));
model - naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Group));
results - predict(model,as.matrix(testmatrix))

Topic naive-bayes-classifier r machine-learning

Category Data Science


I am assuming that you are referring to this Stackoverflow post that mentions to add noise to the data since the error seems to be coming when there is one (or small) instance of a class in the dataset. Is that the case with the training data? If what you're trying to predict is a rare-event, then a suggestion might be to balance the training data by oversampling the rare class (hence adding noise).

Provided the above is not working, another suggestion is to remove infrequent terms in your term-document-matrix using the function removeSparseTerms.

Going beyond, given the amount of training data you have, it would be good to evaluate if the term document matrix with the words it contains or frequency of specific words is sufficient to differentiate the classes. If not, you should consider adding new features to describe the dataset.

Few suggestions:

  • count of positive/negative words or a sentiment index that ranges from -1 to 1, if relevant for your data
  • types of words in dataset (index or count of adjectives or nouns or verbs), again depending on your problem & data
  • rather than using term-document-matrix, try noun-phrases

Finally, I'm assuming that your test data contains records for both classes. If not, it is difficult to evaluate the model.

Hope that helps. If you could formulate your question more clearly with the data problem and provide some examples of the data, that would help.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.