How to structure unstructured data

I am analysing tweets and have collected them in an unstructured format. What is the best way to structure this data so I can begin the data mining processes? Somebody suggested using python packages such as spacy but not sure how to go about using this.

Topic structured-data nlp data-mining

Category Data Science


In Natural Language Processing it's crucial to choose the representation of the data and the design of the system based on the intended task, there is no generic method to represent text data which fits every application. This is not a simple technical problem, it's an important part of designing the system.

The simplest method to structure text data is to represent the sentence or document as a bag of words (BoW), i.e. a set containing all the tokens in the sentence or document. Such a set can be represented with One-Hot-Encoding (OHT) over the full vocabulary (all the words in all the documents) in order to obtain structured data (features). Many preprocessing variants can be applied: remove stop words, replace words with their lemma, filter out rare words, etc. (don't neglect them, these preprocessing options can have a huge impact on performance).

Despite their simplicity, BoW models usually preserve the semantic information of the document reasonably well. However they cannot handle any complex linguistic structure: negations, multiword expressions, etc.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.