Time Complexity notation in Big Data platforms

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over …
Category: Data Science

Timing sequence in MapReduce

I'm running a test on MapReduce algorithm in different environments, like Hadoop and MongoDB, and using different types of data. What are the different methods or techniques to find out the execution time of a query. If I'm inserting a huge amount of data, consider it to be 2-3GB, what are the methods to find out the time for the process to be completed.
Category: Data Science

Would it be possible/practical to build a distributed deep learning engine by tapping into ordinary PCs' unused resources?

I started thinking about this in the context of Apple's new line of desktop CPUs with dedicated neural engines. From what I hear, these chips are quite adept at solving deep learning problems (as the name would imply). Since I can only imagine the average user wouldn't necessarily be optimizing cost functions on a regular basis, I was wondering if it would be theoretically possible to use those extra resources set up some type of distributed network similar to a …
Category: Data Science

Word count with map reduce

Suppose we use an input file that contains the following lyrics from a famous song: We’re up all night till the sun We’re up all night to get some The input pairs for the Map phase will be the following: (0, "We’re up all night to the sun") (31, "We’re up all night to get some") The key is the byte offset starting from the beginning of the file. While we won’t need this value in Word Count, it is …
Category: Data Science

KMeans using Mapreduce in Python

I wrote a mapreduce code in python which works locally i.e., cat test_mapper |python mapper.py sort the result, and cat sorted_map_output |python reducer.py produces the desired result. As soon as this code is submitted to the mapreduce engine, it fails: <code>21/08/09 11:03:11 INFO mapreduce.Job: map 50% reduce 0% 21/08/09 11:03:11 INFO mapreduce.Job: Task Id : attempt_1628505794323_0001_m_000001_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538) ... 21/08/09 11:03:21 INFO mapreduce.Job: map 100% reduce 100% …
Category: Data Science

What is the MapReduce application master?

From Hadoop The Definitive Guide The whole process is illustrated in Figure 7-1. At the highest level, there are five independent entities: • The client, which submits the MapReduce job. • The YARN resource manager, which coordinates the allocation of compute re‐ sources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the Map‐ Reduce job. The application master …
Category: Data Science

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) …
Category: Data Science

Dataset map function error : TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor

I am currently trying to write a script to create a TFRecord file. Therefore, I am following the instruction on the offical tensorflow website: https://www.tensorflow.org/tutorials/load_data/tfrecord#writing_a_tfrecord_file However, when applying the map function to each element of the Dataset I get an error that I do not understand. This is my code (should be copy and pasteable): import numpy as np import tensorflow as tf from tensorflow.data import Dataset def generate_random_img_data(n_count=10, patch_size=5): return np.random.randint(low=0, high=256, size=(n_count, patch_size, patch_size, 3)) def as_int64_feature(value): return …
Category: Data Science

Loading file into and out of HDFS via system call/cmd line vs using libhdfs

I am trying to implement a simple C/C++ program for the HDFS file system like word count, it takes a file from the input path puts it into HDFS (where it gets split), processed my map-reduce function and gives an output file which I place back to the local file system. My question is what makes better design choice to load the files into HDFS: From a C program call bin/hdfs dfs -put ../inputFile /someDirectory or make use of libhdfs?
Category: Data Science

How to read in all text files from UNIX bash directory in Cloudera's Python API

I'm still pretty new to Cloudera and using the UNIX environment. I have written a mapper that reads in .txt files from a directory in my Windows system, which works just fine. I read files in like this: import glob files = glob.glob("*.txt") Is there an equivalent way to do this in the UNIX environment? I know I can read in one file by infile=sys.stdin but as far as reading all in from one directory I'm not sure. Thanks!
Category: Data Science

Which one of these tasks will benefit the most from SPARK?

My company processes Data (I am only an intern). We primarily use Hadoop. We're starting to deploy spark in production. Currently we have have two jobs, we will choose just one to begin with spark. The tasks are: The first job does analysis of a large quantity of text to look for ERROR messages (grep). The second job does machine learning & calculate models prediction on some data with an iterative way. My question is: Which one of the two …
Category: Data Science

Nearest neighbors search for very high dimensional data

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I performed, my assumption is that the method I will use will need to be either parallel or distributed. So I'm considering two classes of possible solutions: one that is …
Category: Data Science

Pig is not able to read the complete data

I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with a java outofmemory exception. I've tried increasing the Pig Heap size to 8192, but that hasn't worked, however my code works fine if I use only 25% of the dataset. This is the last …
Category: Data Science

Suggestions on what patterns/analysis to derive from Airlines Big Data

I recently started learning Hadoop, I found this data set http://stat-computing.org/dataexpo/2009/the-data.html - (2009 data), I want some suggestions as what type of patterns or analysis can I do in Hadoop MapReduce, i just need something to get started with, If anyone has a better data set link which I can use for learning, help me here. The attributes are as: 1 Year 1987-2008 2 Month 1-12 3 DayofMonth 1-31 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure …
Category: Data Science

Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script: dumbo start matrix2seqfile.py -input hdfs://hm1/user/trainf1.csv -output hdfs://hm1/user/train_hdfs5.mseq -numreducetasks 25 -hadoop $HADOOP_INSTALL When I inspect my file system, I find that all the files produced are accumulated only in the hvs node. Ideally, I'd like the files to get distributed throughout the cluster--my data …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.