Time Complexity notation in Big Data platforms

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity?

For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over k nodes, my time complexity would simply become O(n/k) + C + C'. Here, C' can be assumed as the job planning time overhead. Note that there was no shuffling involved, and reducer's job was nearly trivial.

I am interested in a more complete analysis of algorithm with iterative loops over data and involve heavy shuffling and reducer operations. I want to incorporate, if possible, the I/O operations and network transfers of data.

Topic map-reduce algorithms bigdata

Category Data Science


Big O time complexity is designed to analyze abstract algorithms, independent of implementation.

If you are interested in the performance of an actual implemented system, including shuffling, I/O operations, and network transfers, it would be more useful to look at benchmarking/profiling. Benchmarking/profiling finds the empirical performance by finding the observed time for specific operations.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.