Bag-of-words and Spam classifiers

I implemented a spam classifier using Bernoulli Naive Bayes, Logistic Regression, and SVM. Algorithms are trained on the entire Enron spam emails dataset using the Bag-of-words (BoW) approach. Prediction is done on the UCI SMS Spam Collection dataset. I have 3 questions:

  1. During test time, while creating the term-frequency matrix, what if none of the words from my training BoW are found in some of my test emails/smses. Then, wouldn't the document vectors be zero vectors for those datapoints. How should I tackle this?

  2. What if a new word from my test email/sms doesn't exist in BoW?

  3. How do I choose my BoW so as to improve my prediction accuracy?

Topic bag-of-words naive-bayes-classifier deep-learning nlp machine-learning

Category Data Science


As answered by Erwan the training data needs to be similar to testing data.

But what if your training data and testing data contains similar emails but not the exact words? This scenario cannot be addressed using BoW term frequency because it doesn't incorporate semantics knowledge (Inability to learn similarity between text like 'old car' and 'used vehicle').

I would suggest you use Word Embedding (with Word2Vec) or the advanced Language models (with BERT).

This article would be handly if you are a beginner.


Supervised ML works on the assumption that the test data follows the same distribution as the training data. This is not always the case in real-world use cases, but it's at least necessary that the test data is "mostly similar" to the training data.

As a consequence, a BoW model can only be applied to data which uses mostly the same vocabulary, with a mostly similar distribution over the words. It is true that out of vocabulary (OOV) words frequently appear in the test data, because words in natural languages follow a Zipf distribution so there are many words which occur rarely. The general assumption in BoW ML models is that since these words occur rarely, they can reasonably be ignored.

  1. During test time, while creating the term-frequency matrix, what if none of the words from my training BoW are found in some of my test emails/smses. Then, wouldn't the document vectors be zero vectors for those datapoints. How should I tackle this?

This event is supposed to be unlikely: a sentence usually contains at least a few common words (again due to the Zipf distribution). However this could happen with a very short text message. Anyway there is nothing special to do about it: all the words are ignored, the vector contains only zeros indeed, the model gives a prediction for this vector of features.

  1. What if a new word from my test email/sms doesn't exist in BoW?

This is the traditional case of OOV words mentioned above. The most simple (and probably most common) option is to completely ignore the unknown word. With some probabilistic models smoothing can be used to account for the existence of OOV words, but as far as I know this is used only with n-grams.

  1. How do I choose my BoW so as to improve my prediction accuracy?

Experimentally: use a validation set in order to evaluate several methods, then select the one which performs best and apply only this one on the test set.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.