Format for storing textual data

For an upcoming project, I'm mining textual posts from an online forum, using Scrapy. What is the best way to store this text data? I'm thinking of simply exporting it into a JSON file, but is there a better format? Or does it not matter?

Topic crawling text-mining

Category Data Science


Let me assume you intend to use Python libraries to analyze the data, since you are using Scrapy to gather the data.

If this is true, then a factor to consider for storage would be compatibility with other Python libraries. Of course, plain text is compatible with anything. But e.g. Pandas has a host of IO tools that simplifies reading from certain formats. If you intend to use scikit-learn for modeling, then Pandas can still read the data in for you, if you then cast it from a DataFrame to a Numpy array as an intermediate step.

These tools allow you to read CSV and JSON, but also HDF5 ... particularly, I would draw your attention to the experimental support for msgpack, which seems to be a binary version of JSON. Binary means here that the stored files will be smaller and therefore faster to read and write. A somewhat similar alternative is BSON, which has a Python implementation — no Pandas or Numpy involved.

Considering these formats only makes sense if you intend to give at least some formatting to the stored text, e.g. storing the post title separately from the post content, or storing all posts in a thread in order, or storing the timestamp ... If you considered JSON at all, then I suppose this is what you intended. If you just intend to store the plain post contents, then use plain text.


I think one has to be very careful when storing textual data. If they are user comments then, for security concerns it's better if it is encoded in some format before storage. A protobuf object can then be defined to resolve the encoding.

Depending on the query pattern, and accepted latency in retrieval of the data, DB should be decided. Just a recommendation, if the idea is to store comments over a period of time for each user, consider HBase or Cassandra. They are optimized for time range queries.

Recommend read: http://info.mapr.com/rs/mapr/images/Time_Series_Databases.pdf


Suppose that a forum post has, on average, 2000 characters, which is more or less the equivalent of a page of text, than the total memory needed is 10MB if text is ASCII. Even if the text is Unicode encoded in an Asian language it will take 40MB.

This is far too little for modern computers, so a simple text format is the best since it can be parsed in the fastest way, and loaded into RAM all at once.


In general, use a storage method that allows you to quickly query it. If your collection is huge, you might need something Lucene-based, like ElasticSearch. If you are a SQL crack and your favorite DB supports it, a full-text index might do the trick. For small sizes like the 5000 documents, even Linux' LocateDb+grep or OSX' spotlight could be enough.

The important point is to be able to quickly verify assumptions about the content of your data - how many documents contain X and Y, does any document contain W but not V, etc.

This will be useful at both the whole set level as well as for analyzing your topic clusters. Finally, a few GNU tools or SQL mastery can also help you profile your document sets more efficiently (n-gram counts/ranks, collocations, concordances, etc)

Edit: that means, for the above reasons and given your collection size, good old plain text (in a file system or a database) might be more efficient than any "fancy" format.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.