Need help to increase classification accuracy for classified ads posting

I have to predict the category under which ad was posted using the provided data; I cannot gain accuracy more than 74% for my model. I am not sure what I am missing.

What I have done so far:

  1. Cleaned the text using re nltk
  2. Used stemmer
  3. CountVectorizer Tfidftransformer
  4. Used MultinomialNB, LinearSVC RandomForestClassifier

Following is my code :

import json
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC,SVC

x_train = []
y_train = []

with open(training-2.json, r,encoding= utf-8) as file: 
     l = file.readline()
     for line in file:
          data = json.loads(line)  
          joined_data = data[city]+   + data[section] +   + data[heading]
          x_train.append(joined_data)
          y_train.append(data[category])

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0,len(x_train)):
  feature = re.sub([^a-zA-z],  , x_train[i])
  feature = feature.lower()
  feature = feature.split()
  ps = PorterStemmer()
  feature = [ps.stem(word) for word in feature if not word in set(stopwords.words(english))]
  feature =  .join(feature)
  corpus.append(feature)

 text_clf = Pipeline([('vect', CountVectorizer()),('itdf', Tfidftransformer())('clf', LinearSVC())
                      ])

 text_clf.fit(corpus,y_train)

After doing all the above steps I only get accuracy max 74% in the pipeline I have used different models.

Sample Data :

{city:newyork,category:cell-phones,section:for-sale,heading:New batteries C-S2 for Blackberry 7100/7130/8700/Curve/Pearl}
{city:newyork,category:cell-phones,section:for-sale,heading:*******   Brand New Original SAMSUNG GALAXY NOTE 2 BATTERY ******}

Topic nltk classification machine-learning

Category Data Science


Here a are a few things I'd look into:

  1. Are the categories balanced in training-2.json? Class imbalance is a well-known issue in ML development, particularly whenever the class distribution on the training set does not match the distribution on the test set.
  2. More interestingly, even if the classes are balanced (which, again, is a strong assumption that I recommend to verify), the input/texts might not be: given that you're concatenating city, section, and heading, you may face issues if some city or some section have many more datapoints than the rest, as the model may incorrectly correlate particular sections or cities with ad categories, and then unseen ads for those sections or cities would be misclassified in bulk. Are you sure you want the model to consider city and section? Should the category of the ad be dependent on the city? For instance, should the same ad go to "Sports" in San Francisco but to "Politics" in Atlanta? As far as I can tell, this is not the case (since an ad about politics will always be about politics, regardless of the city where it is served), so adding that to the input is likely to be a confounder for the model. I'd recommend to only use the heading for this task, given the available information.
  3. Apply the data clean-up basics, like removing duplicates and near-duplicates.
  4. More generally, do some general EDA (exploratory data analysis) to detect potential issues: related to point 2, sometimes there are well-defined clusters of documents that may bias the model towards specific latent sub-categories. For instance, if a category like Sports has 100 ads belonging to 3 clusters with 80, 15 and 5 documents, respectively, you can be quite certain that the first cluster will dominate the classification. That means that you're not really training a classifier for "Sports", but rather for the first cluster, and that ads in that cluster, regardless of their actual category, will be assigned to Sports, which can be another important source of noise. Again, I'd recommend to balance your dataset as much as possible over the full range of legitimate variance exhibited by your target domain.
  5. Are there any issues with feature covariance or low frequency? Maybe you need to apply regularization to avoid overfitting?
  6. What happens if you disable the stemmer? Stemmers are a rather crude for of preprocessing and they often introduce more errors than correct stemmings. If your dataset is big enough, I'd not use it. Consider using character-level features in the vectorizers instead (it's a parameter that you can set explicitly, and is a much better way of accounting for morphological variants).
  7. Using CountVectorizer and TfidfVectorizer with short texts like these is tricky because their output becomes a bit meaningless: short titles tend to contain a single occurrence of most of their words (at least, of content words), which means that CountVectorizer has essentially no relevant input it can take advantage of (and will return a [0, 0, 0, 1, ...] vector for most datapoints, basically a dictionary encoding applying the identity function), and TfidfVectorizer is also missing the TF term of the TFIDF equation for the same reason, which basically ends up giving you an inverse probability matrix that penalizes all relevant category-defining words. So, I would probably either 1) use dictionary-based encoding or 2) fit the vectorizer over a modified X_train object where I have added a document for every category, and each of those documents contains the concatenated heading text of all the ads in that category (remember you can only fit the vectorizer like this at training time, but nothing prevents you from using it to transform test inputs once it has been fitted). In this way, the TF term will be significant again and it will be boosted according to the terms' appropriate strength in each category (= in the Sports category, "see the game" will be frequent terms), and the IDF term will now be more fair (= terms that are frequent across all categories are probably unrelated to any particular one of them).

You can't say you are not getting better performance after just checking 3 models. There are a whole lot of models that you can use with your dataset to get the best performing one.

Also the data cleaning part can be done using different libraries (depending on the data). I don't know what your dataset looks like but I am sure you can try much more technique than just Countvectorizer and tfidf.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.