Is there a process flow to follow for text analytics?

I am trying to draw a process flow (like a template) to be followed while on text analysis projects. So far, I've come up with this.

Text Analytics Steps

  1. Data Collection
    • Acquire data
    • Convert data into plain text
  2. Remove Duplicate Entries
  3. Text Parsing and Extracting Features
    • Tokenization
    • Parsing
      1. Remove HTML characters
      2. Decode complex symbols to UTF-8
      3. Spell check
      4. Apostrophe look-up
      5. Remove punctuation marks
      6. Remove expressions / emojis
      7. Split attached words
      8. Slangs look-up
      9. Remove URLs
    • Lemmatization / Stemming (Normalization of Tokens)
    • Parts-of-Speech Tagging
  4. Text Filtering
    • Remove start-words
    • Remove stop-words
    • Remove irrelevant words based on frequency
  5. Text Transformation
    • Bag of Words Representation
    • TF-IDF
  6. Text Mining / Analysis (whichever analysis needed)
    • Text Categorization
    • Text Classification (supervised)
    • Topic Modeling (unsupervised)
    • Text Clustering
    • Similarity Analysis
    • Sentiment Analysis

Is this flow in the right order of steps?
What are the steps/sub-steps that I am missing?
Does the process flow look like a template or go-to flow chart when undertaking any text analytics project?

Edit: Updated process flow

Topic text-filter text-mining nlp

Category Data Science


This is a great place to start! While not catalogued in a "process flow", Daniel Jurafsky's book, "Speech and Language Processing" talks through the various calculations and steps related to analyzing text that you will find useful.

The reason I say that a process flow is not provided is because Jurafsky - in great detail - explains the pros and cons of particular methods applied throughout a pipeline, and how this could change results. As an example, when calculating perplexity (an inverse metric that quantifies how well a language model can predict the next word in a statement), you should capture beginnings, ends, and stop words of statements - as opposed to other methods that require the removal of stop words.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.