Creating training data

Question

Creating training data

gogasca

2019年6月9日 15:47

My task is to classify free text originated from customer complaints about our product.

I have created a Taxonomy and have around 10 different categories. I've realized that these categories include keywords.

Example:

"Customer doesn't understand how to use the product".

Keywords: understand, knowledge, know, aware.

Record:

Training, Customer doesn't understand how to use the product

I'm using Google Prediction API. When training the model, I would categorize previous text as: "Customer doesn't understand how to use the product" - Training.

How can I add keywords to free text/training data to help the model perform better and provide a better confidence level?

Data in training set:

Training, understand knowledge know aware
Training, Customer doesn't understand how to use the product

Right now, I'm adding Keywords into same training data, but looking for a better suggestion.

Topic google-prediction-api nltk nlp data-cleaning

Category Data Science

Brandon Loudermilk · Accepted Answer · 2016年3月5日 17:27

Assuming you are doing supervized learning to train a model that when deployed will take text as input and output a label (e.g., topic) or class probability, then what you probably want to do is balanced, stratified sampling. Assuming sufficient labelled data, ensure that your final training set has a balanced number of text examples for each class/label. Depending on your situation, you may need to over/under sample or somehow deal with the problem of highly imbalanced classes (see 8 tactics to combat imbalanced classes).

The simplest NLP approach to use a bag of words technique, simply indicating the presence/absence of a word in the sentence. Thus each sentence becomes represented as vector of length n, where n = the number of unique words in your data set.

data_set <- c("the big dog", "the small cat", "the big and fat cow")
words <- strsplit(data_set, split = " ") #tokenize sentences
words
##[[1]]
##[1] "the" "big" "dog"
##
##[[2]]
##[1] "the"   "small" "cat"  
##
##[[3]]
##[1] "the" "big" "and" "fat" "cow"


vec <- unique(unlist(words)) #vector representation of sentences
##[1] "the"   "big"   "dog"   "small" "cat"   "and"  
##[7] "fat"   "cow" 

m <- matrix(nrow = length(data_set), ncol = length(vec))

for (i in 1:length(words)) { #iterate the index of tokenized sentences
  vec_rep <- vec %in% words[[i]] #create binary word-feature vector
  m[i,] <- vec_rep #update matrix
}

df <- data.frame(m, row.names = NULL)
names(df) <- vec
df
##   the   big   dog small   cat   and   fat   cow
##1 TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
##2 TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
##3 TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Sometimes you can increase performance by adding bi-gram and tri-gram features.

##given: "the big dog"
##unigrams <- {the, big, dog}
##bigrams <- {the big, big dog}
##trigrams <- {the big dog}

Sometimes weighting words by their frequency improves performance or computing the tf-idf.

Another way to increase performance, in my experience, has been custom language feature engineering, especially if the data is from social media sources replete with spelling errors, acronyms, slang, and other word variants. Standard NLP approaches will typically remove stop words (e.g., the, a/an, this/that, etc.) from vector representations (because closed class, high frequency words often don't help discriminate among class/label boundaries). Because vector representations are typically highly dimensional (approximately num of unique words in the corpus/data set), dimensionality reductions techniques can increase performance. For example, one can compute chi-sq, info gain, etc. on a word feature's distribution across classes -- only keep those features/words above some threshold or below some pre-established p value).

Creating training data

About