Accumulators in Spark (PySpark) without global variables?

BACKGROUND Consider the following textbook example which uses accumulators to add vectors. from pyspark import AccumulatorParam class VectorAccumulatorParam(AccumulatorParam): def zero(self, value): dict1 = {i: 0 for i in range(0, len(value))} return dict1 def addInPlace(self, val1, val2): for i in val1.keys(): val1[i] += val2[i] return val1 rdd1 = sc.parallelize([{0: 0.3, 1: 0.8, 2: 0.4}, {0: 0.2, 1: 0.4, 2: 0.2}, {0: -0.1, 1: 0.4, 2: 1.6}]) vector_acc = sc.accumulator({0: 0, 1: 0, 2: 0}, VectorAccumulatorParam()) def mapping_fn(x): global vector_acc vector_acc += …
Category: Data Science

Storage of N-dimensional matrices (tensors) as part of machine learning pipelines

I'm an infra person working on a storage product. I've been googling quite a bit to find an answer to the following question but unable to do so. Hence, I am attemping to ask the question here. I am aware that relational data or structured data can often be represented in 2-dimensional tables like DataFrames and that can be used for ML training input. If we want to store the DataFrames they can easily be stored as tables in a …
Category: Data Science

Word count with map reduce

Suppose we use an input file that contains the following lyrics from a famous song: We’re up all night till the sun We’re up all night to get some The input pairs for the Map phase will be the following: (0, "We’re up all night to the sun") (31, "We’re up all night to get some") The key is the byte offset starting from the beginning of the file. While we won’t need this value in Word Count, it is …
Category: Data Science

KMeans using Mapreduce in Python

I wrote a mapreduce code in python which works locally i.e., cat test_mapper |python mapper.py sort the result, and cat sorted_map_output |python reducer.py produces the desired result. As soon as this code is submitted to the mapreduce engine, it fails: <code>21/08/09 11:03:11 INFO mapreduce.Job: map 50% reduce 0% 21/08/09 11:03:11 INFO mapreduce.Job: Task Id : attempt_1628505794323_0001_m_000001_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538) ... 21/08/09 11:03:21 INFO mapreduce.Job: map 100% reduce 100% …
Category: Data Science

Saving Large Spark ML Pipeline to HDFS

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size: scala> val mod = pipeline.fit(df) mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716 scala> mod.write.overwrite().save(modelPath.concat("model")) 18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size (755610 KB). The maximum recommended task size is 100 KB. org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 2606:0 was …
Category: Data Science

PySpark: java.io.EOFException

System: 1 name node, 4 cores, 16 GB RAM 1 master node, 4 cores, 16 GB RAM 6 data nodes, 4 cores, 16 GB RAM each 6 worker nodes, 4 cores, 16 GB RAM each around 5 Terabytes of storage space The data nodes and worker nodes exist on the same 6 machines and the name node and master node exist on the same machine. In our docker compose, we have 6 GB set for the master, 8 GB set …
Category: Data Science

Can Single Node Hadoop Cluster be installed on a system with 1gb RAM

I am trying to learn hadoop, would like to know if for basic single node installation 1gb RAM system would be enough or we need more RAM. It would be helpful if someone can share what other minimum system requirements which I can setup single node setup. I trying to check on Apache Hadoop site but there is no specific mention of minimum system requirements for installation. Thanks
Category: Data Science

SAP HANA or Hadoop?

this is a question regarding a career choice. I am a fresher and I recently joined an MNC in Data Engineering team. There I was offered training in either Hadoop or SAP HANA. I am in doubt as to which one should I choose. Can anyone help me make the right choice? Which of these two has better scope based on the current trend? Thanks in advance.
Category: Data Science

How to run unmodified Python program on GPU servers with scheduled GPUs?

Say I have one server with 10 GPUs. I have a python program which detects available GPU and use all of them. I have a couple of users who will run python (Machine learning or data mining) programs and use GPU. I initially thought to use Hadoop, as I find Yarn is good at managing resources, including GPU, and YARN has certain scheduling strategies, like fair, FIFO, capacity. I don't like hard-coded rules, eg. user1 can only use gpu1, user2 …
Category: Data Science

What are common problems around HADOOP storage?

I've been asked to lead a program to understand why our Hadoop storage is constantly near capacity. What questions should I ask? Data age, Data size? Housekeeping schedule? How do we identify the different types of compression used by different applications? How can we identify where the duplicate data sources are? Are jobs designated for edge nodes only on edge nodes?
Category: Data Science

What is the main difference between Hadoop and Spark?

I recently read the following about Hadoop vs. Spark: Insist upon in-memory columnar data querying. This was the killer-feature that let Apache Spark run in seconds the queries that would take Hadoop hours or days. Memory is much faster than disk access, and any modern data platform should be optimized to take advantage of that speed. Also, columnar data storage greatly reduces the amount of memory spent on empty or redundant data. Can someone explain: 1) what Apache Hadoop and …
Category: Data Science

What is the MapReduce application master?

From Hadoop The Definitive Guide The whole process is illustrated in Figure 7-1. At the highest level, there are five independent entities: • The client, which submits the MapReduce job. • The YARN resource manager, which coordinates the allocation of compute re‐ sources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the Map‐ Reduce job. The application master …
Category: Data Science

Mapreduce jobs not working in hive

I was trying to execute a hive query: select name, count(*) from amazon where review != NULL group by name ; Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564815666993_0001, Tracking URL = http://aamir-VirtualBox:8088/proxy/application_1564815666993_0001/ Kill Command = …
Category: Data Science

How to make k-means distributed?

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) …
Category: Data Science

Hive / Impala best practice code structuring

Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases. It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop …
Category: Data Science

What are the use cases for Apache Spark vs Hadoop

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.
Category: Data Science

cannot access hive from spark

I am trying to install a hadoop + spark + hive cluster. I am using hadoop 3.1.2, spark 2.4.5 (scala 2.11 prebuilt with user-provided hadoop) and hive 2.3.3 (also tried 3.1.2 with the exact same results). All downloaded from their websites. I can run spark apps (as yarn client) with no issues, I can run hive queries directly (beeline) or via pyhive with no issues (I tried both hive-on-mr and hive-on-spark, both working fine, jobs are created by yarn and …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.