How big is big data?

Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated terms, like scalability and efficiency, but what exactly defines a problem as a big data problem?

Does the computation have to be related to some set of specific purposes, like data mining/information retrieval, or could an algorithm for general graph problems be labeled big data if the dataset was big enough? Also, how big is big enough (if this is possible to define)?

To me (coming from a relational database background), "Big Data" is not primarily about the data size (which is the bulk of what the other answers are so far).

"Big Data" and "Bad Data" are closely related. Relational Databases require 'pristine data'. If the data is in the database, it is accurate, clean, and 100% reliable. Relational Databases require "Great Data" and a huge amount of time, money, and accountability is put on to making sure the data is well prepared before loading it in to the database. If the data is in the database, it is 'gospel', and it defines the system understanding of reality.

"Big Data" tackles this problem from the other direction. The data is poorly defined, much of it may be inaccurate, and much of it may in fact be missing. The structure and layout of the data is linear as opposed to relational.

Big Data has to have enough volume so that the amount of bad data, or missing data becomes statistically insignificant. When the errors in your data are common enough to cancel each other out, when the missing data is proportionally small enough to be negligible and when your data access requirements and algorithms are functional even with incomplete and inaccurate data, then you have "Big Data".

"Big Data" is not really about the volume, it is about the characteristics of the data.

Data is "Big Data" if it is of such volume that it is less expensive to analyze it on two or more commodity computers, than on one high-end computer.

This is essentially how Google's "BigFiles" file system originated. Page and Brin could not afford a fancy Sun server to store and search their web index, so hooked up several commodity computers

"Big data" is literally just a lot of data. While it's more of a marketing term than anything, the implication is usually that you have so much data that you can't analyze all of the data at once because the amount of memory (RAM) it would take to hold the data in memory to process and analyze it is greater than the amount of available memory.

This means that analyses usually have to be done on random segments of data, which allows models to be built to compare against other parts of the data.

There is special thing to graph algorithms, you original questions which makes then special, which is about he ability to partition the data essentially.

For some things, like sorting numbers on an array it is not too difficult to partition the problem on the data structure into smaller disjunctive pieces, e.g. Here: Parallel in place merge sort

For graph algorithms however there is the challenge that finding an optional partitioning on a given graphic metric is known to be $NP-hard$.

So while 10GB of numbers to sort might be a very well approachable problem on a normal PC (You can just to in via dynamic programming and have very good predictability about the program flow), working with a 10GB graph data structure can already by challenging.

There are a number of specialized frameworks such as GraphX using methods and special computing paradigms to somewhat circumvent the inherent challenges of graphs.

So to answer your question briefly: As mentioned before by others, when your data does not fit into main memory on a normal PC but you need all of it to answer your problem, is a good hint that your data is already somewhat big. The exact labeling though depends i think a bit on the data structure and question asked.

I tend to agree with what @Dan Levin has already said. Ultimately since we want to draw useful insights from the data rather than just storing it, it's the ability of learning algorithms/systems which should determine what is called "Big data". As ML systems evolve what was Big data today will no longer be Big Data tomorrow.

One way of defining Big data could be:

  • Big data: Data on which you can't build ML models in reasonable time ( 1-2 hours) on a typical workstation ( with say 4GB RAM)
  • Non-Big data: complement of the above

Assuming this definition, as long as the memory occupied by an individual row (all variables for a single data point) does not exceed machine RAM we should be be in the Non-big data regime.

Note: Vowpal Wabbit (by far the fastest ML system as of today) can learn on any data set as long as an individual row ( data point) is < RAM ( say 4GB). The number of rows is not a limitation because it uses SGD on multiple cores. Speaking from experience you can train a model with 10k features and 10MN rows on a laptop in a day.

I'll share what Big Data is like in genomics, in particular de-novo assembly.

When we sequence your genome (eg: detect novel genes), we take billions of next-generation short reads. Look at the image below, where we try to assemble some reads.

enter image description here

This looks simple? But what if you have billion of those reads? What if those reads contain sequence errors? What if your RAM doesn't have enough memory to keep the reads? What about repetitive DNA regions, such as the very common Alu Element?

De-novo assembly is done by constructing a De-Bruijn graph:

enter image description here

The graph is a clever-mined data-structure to represent overlapping reads. It's not perfect but it's better than generating all possible overlaps and store them in an array.

The assembly process could take days to complete, because there are quite a number of paths that an assembler would need to traverse and collapse.

In genomics, you have a big data when:

  • You can't brute force all combinations
  • Your computer doesn't have enough physical memory to store the data
  • You need to reduce the dimensions (eg: collapsing redundant graph paths)
  • You get pissed off because you'd have to wait days to do anything
  • You need a special data structure to represent the data
  • You need to filter your data-set for errors (eg: sequencing errors)

I think that big data starts at the point where the size prevents you from doing what you want to. In most scenarios, there is a limit on the running time that is considered feasible. In some cases it is an hour, in some cases it might be few weeks. As long as the data is not big enough that only O(n) algorithms can run in the feasible time frame, you didn't reach big data.

I like this definition since it is agnostic to volume, technology level and specific algorithms. It is not agnostic to resources so a grad student will reach the point of big data way before Google.

In order to be able to quantify how big is the data, I like to consider the time needed to backup it. Since the technology advances, volumes that were considered big some years ago are now moderate. Backup time improves, as the technology improves, just as the running time of the learning algorithms. I feel it is more sensible to talk about a dataset it takes X hours to backup and not of a dataset of Y bytes.


It is important to note that even if you reached the big data point and you can not run algorithms of complexity more than O(n) in the straight forward way, there is plenty you can do in order to still benefit from such algorithms.

For example, Feature selection can reduce the number of features that many algorithms running time depends on. In many long tail distribution focusing in the few items in the head might be of benefit. You can use a sample and run on it the slower algorithms.

Data becomes "big" when a single commodity computer can no longer handle the amount of data you have. It denotes the point at which you need to start thinking about building supercomputers or using clusters to process your data.

To me Big Data is primarily about the tools (after all, that's where it started); a "big" dataset is one that's too big to be handled with conventional tools - in particular, big enough to demand storage and processing on a cluster rather than a single machine. This rules out a conventional RDBMS, and demands new techniques for processing; in particular, various Hadoop-like frameworks make it easy to distribute a computation over a cluster, at the cost of restricting the form of this computation. I'll second the reference to ; Big Data techniques are a last resort for datasets which are simply too big to handle any other way. I'd say any dataset for any purpose could qualify if it was big enough - though if the shape of the problem is such that existing "big data" tools aren't appropriate, then it would probably be better to come up with a new name.

Of course there is some overlap; when I (briefly) worked at, we worked on the same 50TB dataset using Hadoop and also in an SQL database on a fairly ridiculous server (I remember it had 1TB RAM, and this is a few years ago). Which in a sense meant it both was and wasn't big data, depending on which job you were working on. But I think that's an accurate characterization; the people who worked on the Hadoop jobs found it useful to go to Big Data conferences and websites, while the people who worked on the SQL jobs didn't.

Total amount of data in the world: 2.8 zetabytes in 2012, estimated to reach 8 zetabytes by 2015 (source) and with a doubling time of 40 months. Can't get bigger than that :)

As an example of a single large organization, Facebook pulls in 500 terabytes per day, into a 100 petabyte warehouse, and runs 70k queries per day on it as of 2012 (source) Their current warehouse is >300 petabytes.

Big data is probably something that is a good fraction of the Facebook numbers (1/100 probably yes, 1/10000 probably not: it's a spectrum not a single number).

In addition to size, some of the features that make it "big" are:

  • it is actively analyzed, not just stored (quote "If you aren’t taking advantage of big data, then you don’t have big data, you have just a pile of data" Jay Parikh @ Facebook)

  • building and running a data warehouse is a major infrastructure project

  • it is growing at a significant rate

  • it is unstructured or has irregular structure

Gartner definition: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing" (The 3Vs) So they also think "bigness" isn't entirely about the size of the dataset, but also about the velocity and structure and the kind of tools needed.

Big Data is defined by the volume of data, that's right, but not only. The particularity of big data is that you need to store a lots of various and sometimes unstructured stuffs all the times and from a tons of sensors, usually for years or decade.

Furthermore you need something scalable, so that it doesn't take you half a year to find a data back.

So here's come Big Data, where traditional method won't work anymore. SQL is not scalable. And SQL works with very structured and linked data (with all those Primary and foreign key mess, innerjoin, imbricated request...).

Basically, because storage becomes cheaper and cheaper and data becomes more and more valuable, big manager ask engineer to records everything. Add to this tons of new sensors with all those mobile, social network, embeded stuff ...etc. So as classic methods won't work, they have to find new technologies (storing everything in files, in json format, with big index, what we call noSQL).

So Big Data may be very big but can be not so big but complexe unstructured or various data which has to be store quickly and on-the-run in a raw format. We focus and storing at first, and then we look at how to link everything together.

As you rightly note, these days "big data" is something everyone wants to say they've got, which entails a certain looseness in how people define the term. Generally, though, I'd say you're certainly dealing with big data if the scale is such that it's no longer feasible to manage with more traditional technologies such as RDBMS, at least without complementing them with big data technologies such as Hadoop.

How big your data has to actually be for that to be the case is debatable. Here's a (somewhat provocative) blog post that claims that it's not really the case for less than 5 TB of data. (To be clear, it doesn't claim "Less than 5 TB isn't big data", but just "Less than 5 TB isn't big enough that you need Hadoop".)

But even on smaller datasets, big data technologies like Hadoop can have other advantages, including being well suited to batch operations, playing well with unstructured data (as well as data whose structure isn't known in advance or could change), horizontal scalability (scaling by adding more nodes instead of beefing up your existing servers), and (as one of the commenters on the above-linked post notes) the ability to integrate your data processing with external data sets (think of a map-reduce where the mapper makes a call to another server). Other technologies associated with big data, like NoSql databases, emphasize fast performance and consistent availability while dealing with large sets of data, as well also being able to handle semi-unstructured data and to scale horizontally.

Of course, traditional RDBMS have their own advantages including ACID guarantees (Atomicity, Consistency, Isolation, Durability) and better performance for certain operations, as well as being more standardized, more mature, and (for many users) more familiar. So even for indisputably "big" data, it may make sense to load at least a portion of your data into a traditional SQL database and use that in conjunction with big data technologies.

So, a more generous definition would be that you have big data so long as it's big enough that big data technologies provide some added value for you. But as you can see, that can depend not just on the size of your data but on how you want to work with it and what sort of requirements you have in terms of flexibility, consistency, and performance. How you're using your data is more relevant to the question than what you're using it for (e.g. data mining). That said, uses like data mining and machine learning are more likely to yield useful results if you have a big enough data set to work with.


