Reading and Processing twitter network data

I have this big data collected from here. So what I would like to asks are

  1. How to perform file I/O with the file? From the download like it was mentioned that it used .tsv format, but after unpacking I got .twitter file which is foreign for me and so far I haven't found any reliable documentation regarding file I/O of this file type.
  2. Since the file is huge, supposed i could do file I/O it is still impossible to load everything to a single machine (It is 23 gigabytes in size). What is the tools that is perfect for this, say for graph processing? Is pyspark the right tools?

Topic twitter bigdata

Category Data Science


I haven't downloaded the 5GB tar- TSVfile, but there is also a small file "Extraction code: download twitter.tar.bz2 (14.46 KiB)" on the konect.uni-koblenz.de website you mentioned.

This twitter.tar.bz2 file is a small archive of several binary executables designed to run on Linux. The README says the following

Usage

To build the datasets, execute "make" or "stu" inside the directory. The code downloads the datasets from their online sources and converts them to the KONECT format.

Dependencies

You may need to install the following additional software packages:

  • unzip
  • aria2
  • tofrodos (providing fromdos/todos)

So I assume they want you to convert this to KONECT format (whatever that is) and use their MATLAB toolbox to process the dataset.

Alternatively, the third file linked twitter.n3.bz2 on their website is an RDF file which I haven't looked at. It is in N3 format. You can process the file with java-based Tomcat-Webapp Blazegraph for instance, but you need to know a lot of stuff to setup this software, and it is complicated to use even for smaller datasets (say 100 kb).

Maybe other graph databases (besides Blazegraph) can read in .rdf Files, but I don't know any. But see this on Softwarerecs.sE

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.