Understanding text conversion into SVM input
In Support Vector Machines, when used for sentiment analysis, text gets converted into a set of data points. How does this happen, usually?
Topic svm nlp libsvm machine-learning
Category Data Science
In Support Vector Machines, when used for sentiment analysis, text gets converted into a set of data points. How does this happen, usually?
Topic svm nlp libsvm machine-learning
Category Data Science
Well the text doesn't get converted into data points ... Let's say we are doing sentence level opinion mining.. Features are extracted from a sentence . Now it depends on case to case as to what features to use... A common one is bag of words models in which features become distinct words in sentence and the value of features are the frequency it is repeated in a sentence. Those frequencies are your data points.
Text can be converted to data via the use of concept clusters (after stemming and stopping), or to count (frequencies) via use of n-grams. N-grams are basically tabulations of the 1-gram count (frequency) of alphabet characters (a though z) in each document, and counts of 2-grams (aa to zz), 3-grams (aaa through zzz), up to about 5-grams (aaaaa through zzzzz). Beyond 5-grams, the data will be sparse and less informative. Thus, a dataset can be constructed for which rows represent documents, and columns represent n-grams. The data values themselves are the total number of occurrences of each gram found in each document.
FYI - n-grams have proven to be the best technique for identifying different languages based on characters.
Regarding SVMs, focus on the SVM literature.
Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.