map-reduce

Time Complexity notation in Big Data platforms

Mohitt

2022年5月13日 04:04

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over …

Topic: map-reduce algorithms bigdata

Category: Data Science

Timing sequence in MapReduce

syed

2022年5月2日 06:00

I'm running a test on MapReduce algorithm in different environments, like Hadoop and MongoDB, and using different types of data. What are the different methods or techniques to find out the execution time of a query. If I'm inserting a huge amount of data, consider it to be 2-3GB, what are the methods to find out the time for the process to be completed.

Topic: experiments map-reduce performance efficiency

Category: Data Science

Would it be possible/practical to build a distributed deep learning engine by tapping into ordinary PCs' unused resources?

AffableAmbler

2021年11月22日 23:43

I started thinking about this in the context of Apple's new line of desktop CPUs with dedicated neural engines. From what I hear, these chips are quite adept at solving deep learning problems (as the name would imply). Since I can only imagine the average user wouldn't necessarily be optimizing cost functions on a regular basis, I was wondering if it would be theoretically possible to use those extra resources set up some type of distributed network similar to a …

Topic: apache-spark deep-learning map-reduce machine-learning

Category: Data Science

Word count with map reduce

def __init__

2021年10月4日 12:02

Suppose we use an input file that contains the following lyrics from a famous song: We’re up all night till the sun We’re up all night to get some The input pairs for the Map phase will be the following: (0, "We’re up all night to the sun") (31, "We’re up all night to get some") The key is the byte offset starting from the beginning of the file. While we won’t need this value in Word Count, it is …

Topic: sentiment-analysis map-reduce apache-hadoop bigdata

Category: Data Science

KMeans using Mapreduce in Python

VM_AI

2021年8月9日 18:30

I wrote a mapreduce code in python which works locally i.e., cat test_mapper |python mapper.py sort the result, and cat sorted_map_output |python reducer.py produces the desired result. As soon as this code is submitted to the mapreduce engine, it fails: <code>21/08/09 11:03:11 INFO mapreduce.Job: map 50% reduce 0% 21/08/09 11:03:11 INFO mapreduce.Job: Task Id : attempt_1628505794323_0001_m_000001_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538) ... 21/08/09 11:03:21 INFO mapreduce.Job: map 100% reduce 100% …

Topic: map-reduce apache-hadoop bigdata

Category: Data Science

What is the MapReduce application master?

Tim

2020年8月23日 16:18

From Hadoop The Definitive Guide The whole process is illustrated in Figure 7-1. At the highest level, there are five independent entities: • The client, which submits the MapReduce job. • The YARN resource manager, which coordinates the allocation of compute re‐ sources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the Map‐ Reduce job. The application master …

Topic: map-reduce apache-hadoop

Category: Data Science

How to make k-means distributed?

gsamaras

2020年8月11日 08:46

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) …

Topic: map-reduce python distributed apache-hadoop k-means

Category: Data Science

Dataset map function error : TypeError: Expected list for 'input' argument to 'EagerPyFunc' Op, not Tensor

toom

2020年8月7日 22:02

I am currently trying to write a script to create a TFRecord file. Therefore, I am following the instruction on the offical tensorflow website: https://www.tensorflow.org/tutorials/load_data/tfrecord#writing_a_tfrecord_file However, when applying the map function to each element of the Dataset I get an error that I do not understand. This is my code (should be copy and pasteable): import numpy as np import tensorflow as tf from tensorflow.data import Dataset def generate_random_img_data(n_count=10, patch_size=5): return np.random.randint(low=0, high=256, size=(n_count, patch_size, patch_size, 3)) def as_int64_feature(value): return …

Topic: tensorflow map-reduce

Category: Data Science

Loading file into and out of HDFS via system call/cmd line vs using libhdfs

n0unc3

2020年3月27日 16:48

I am trying to implement a simple C/C++ program for the HDFS file system like word count, it takes a file from the input path puts it into HDFS (where it gets split), processed my map-reduce function and gives an output file which I place back to the local file system. My question is what makes better design choice to load the files into HDFS: From a C program call bin/hdfs dfs -put ../inputFile /someDirectory or make use of libhdfs?

Topic: c map-reduce apache-hadoop bigdata

Category: Data Science

How to read in all text files from UNIX bash directory in Cloudera's Python API

Jabernet

2019年2月9日 23:15

I'm still pretty new to Cloudera and using the UNIX environment. I have written a mapper that reads in .txt files from a directory in my Windows system, which works just fine. I read files in like this: import glob files = glob.glob("*.txt") Is there an equivalent way to do this in the UNIX environment? I know I can read in one file by infile=sys.stdin but as far as reading all in from one directory I'm not sure. Thanks!

Topic: text-mining map-reduce python

Category: Data Science

How to generate list with out for loop in python

Shashank Trivedi

2019年2月6日 08:32

data frame S/N Type Number Capacity 1 Bike 2 5 2 Tempo 1 30 3 Truck-1 1 60 4 Truck-2 1 90 I would like to generate capacitylist = [5,5,30,60,90] Is it possible to do it with out for and using map function in python. Thanks Alot.

Topic: map-reduce python

Category: Data Science

Which one of these tasks will benefit the most from SPARK?

Melchia

2018年1月7日 20:15

My company processes Data (I am only an intern). We primarily use Hadoop. We're starting to deploy spark in production. Currently we have have two jobs, we will choose just one to begin with spark. The tasks are: The first job does analysis of a large quantity of text to look for ERROR messages (grep). The second job does machine learning & calculate models prediction on some data with an iterative way. My question is: Which one of the two …

Topic: apache-spark map-reduce apache-hadoop

Category: Data Science

Nearest neighbors search for very high dimensional data

cjauvin

2017年7月10日 12:46

I have a big sparse matrix of users and items they like (in the order of 1M users and 100K items, with a very low level of sparsity). I'm exploring ways in which I could perform kNN search on it. Given the size of my dataset and some initial tests I performed, my assumption is that the method I will use will need to be either parallel or distributed. So I'm considering two classes of possible solutions: one that is …

Topic: map-reduce dimensionality-reduction distributed machine-learning

Category: Data Science

Pig is not able to read the complete data

shanky_thebearer

2016年12月28日 10:45

I am trying to load a huge dataset of around 3.4 TB with approximately 1.4 million files in Pig on Amazon EMR. The operations on the data are simple (JOIN and STORE), but the data is not getting loaded completely, and the program is terminating with a java outofmemory exception. I've tried increasing the Pig Heap size to 8192, but that hasn't worked, however my code works fine if I use only 25% of the dataset. This is the last …

Topic: apache-pig map-reduce apache-hadoop

Category: Data Science

Can all statistical algorithms be parallelized using a Map Reduce framework

Victor

2016年11月14日 17:51

Is it correct to say that any statistical learning algorithm (linear/logistic regression, SVM, neural network, random forest) can be implemented inside a Map Reduce framework? Or are there restrictions? I guess there may be some algorithms that is not possible to parallelize?

Topic: map-reduce apache-hadoop machine-learning

Category: Data Science

Data science and MapReduce programming model of Hadoop

jithinjustin

2016年8月5日 09:30

What are the different classes of data science problems that can be solved using mapreduce programming model?

Topic: map-reduce apache-hadoop

Category: Data Science

How to multiply a "fat and short" matrix with a "tall and thin" matrix using MapReduce?

wabbit

2016年7月1日 09:14

Assume that $A_{m \times n}$ and $B_{n \times k}$ are to be multiplied to get $C_{m \times k}$. Now if $n$ is too large for a single row $A_{j}$ of $A$ to fit in RAM (and similarly for columns of B) on a single compute node how do we perform the multiplication?

Topic: apache-spark map-reduce apache-hadoop

Category: Data Science

Suggestions on what patterns/analysis to derive from Airlines Big Data

Nikhil Verma

2016年6月22日 23:43

I recently started learning Hadoop, I found this data set http://stat-computing.org/dataexpo/2009/the-data.html - (2009 data), I want some suggestions as what type of patterns or analysis can I do in Hadoop MapReduce, i just need something to get started with, If anyone has a better data set link which I can use for learning, help me here. The attributes are as: 1 Year 1987-2008 2 Month 1-12 3 DayofMonth 1-31 4 DayOfWeek 1 (Monday) - 7 (Sunday) 5 DepTime actual departure …

Topic: map-reduce apache-hadoop

Category: Data Science

Why does map reduce have a shuffle step?

sebastianspiegel

2016年4月7日 06:02

I'm looking at a diagram of map reduce where there is a map step, a shuffle step and then the reduce step. Why shuffle?

Topic: map-reduce

Category: Data Science

Data produced as an output to Dumbo API of Python not getting distributed to all the nodes of cluster

Harshvardhan Solanki

2016年1月28日 23:10

On the node from which I run Dumbo commands, all the files produced as output are produced on the same node. For example, suppose there is a node having name hvs on which I ran the script: dumbo start matrix2seqfile.py -input hdfs://hm1/user/trainf1.csv -output hdfs://hm1/user/train_hdfs5.mseq -numreducetasks 25 -hadoop $HADOOP_INSTALL When I inspect my file system, I find that all the files produced are accumulated only in the hvs node. Ideally, I'd like the files to get distributed throughout the cluster--my data …

Topic: map-reduce python apache-hadoop bigdata

Category: Data Science

About