Need help to increase classification accuracy for classified ads posting
I have to predict the category under which ad was posted using the provided data; I cannot gain accuracy more than 74% for my model. I am not sure what I am missing.
What I have done so far:
- Cleaned the text using re nltk
- Used stemmer
- CountVectorizer Tfidftransformer
- Used MultinomialNB, LinearSVC RandomForestClassifier
Following is my code :
import json
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC,SVC
x_train = []
y_train = []
with open(training-2.json, r,encoding= utf-8) as file:
l = file.readline()
for line in file:
data = json.loads(line)
joined_data = data[city]+ + data[section] + + data[heading]
import re
import nltk'stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,len(x_train)):
feature = re.sub([^a-zA-z], , x_train[i])
feature = feature.lower()
feature = feature.split()
ps = PorterStemmer()
feature = [ps.stem(word) for word in feature if not word in set(stopwords.words(english))]
feature = .join(feature)
text_clf = Pipeline([('vect', CountVectorizer()),('itdf', Tfidftransformer())('clf', LinearSVC())
After doing all the above steps I only get accuracy max 74% in the pipeline I have used different models.
Sample Data :
{city:newyork,category:cell-phones,section:for-sale,heading:New batteries C-S2 for Blackberry 7100/7130/8700/Curve/Pearl}
{city:newyork,category:cell-phones,section:for-sale,heading:******* Brand New Original SAMSUNG GALAXY NOTE 2 BATTERY ******}
Topic nltk classification machine-learning
Category Data Science