Application of bag-of-ngrams in feature engineering of texts

I've got few questions about the application of bag-of-ngrams in feature engineering of texts:

  1. How to (or can we?) perform word2vec on bag-of-ngrams?
  2. As the feature space of bag of n-gram increases exponentially with 'N', what (or are there?) are commonly used together with bag-of-ngrams to increase computational and storage efficiency?
  3. Or in general, does bag of n-gram used alongside with other feature engineering techniques when it's involved in transforming a text fields into a field of text feature?

Topic ngrams feature-engineering word-embeddings

Category Data Science


I answer all 3 questions together. Embeddings get tokens i.e. the smallest meaningful text piece and you define it. It means that you may call characters the smallest meaningful piece, the words, phrases or whatever your creativity lets you. Word2vec is based on words so if you enter ngrams as tokens you get the same feature space in which your ngrams are also taken into account.

If embedding, then the high-dimensionality of BOW feature space will be automatically taken care of (question 2) and entire idea is answering question 3 as well (if not please update your question with exact feature engineering techniques you mean and I will update my answer too)

Disclaimer: The output, however intuitively working, might have some phenomena inside and you need to be careful e.g. the embedding algorithm sees "data" and "science" individually and within the same context tries to embed "data science" too. It might reduce the semantic map. For example in this case, if you do not consider individual tokens of your ngram it will be totally fine but you probably don't want to skip two semantically strong words like "data" and "science". So, be careful.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.