apache-hadoop

Accumulators in Spark (PySpark) without global variables?

Tfovid

2022年3月1日 08:05

BACKGROUND Consider the following textbook example which uses accumulators to add vectors. from pyspark import AccumulatorParam class VectorAccumulatorParam(AccumulatorParam): def zero(self, value): dict1 = {i: 0 for i in range(0, len(value))} return dict1 def addInPlace(self, val1, val2): for i in val1.keys(): val1[i] += val2[i] return val1 rdd1 = sc.parallelize([{0: 0.3, 1: 0.8, 2: 0.4}, {0: 0.2, 1: 0.4, 2: 0.2}, {0: -0.1, 1: 0.4, 2: 1.6}]) vector_acc = sc.accumulator({0: 0, 1: 0, 2: 0}, VectorAccumulatorParam()) def mapping_fn(x): global vector_acc vector_acc += …

Topic: pyspark apache-spark python apache-hadoop

Category: Data Science

Storage of N-dimensional matrices (tensors) as part of machine learning pipelines

user855

2022年2月22日 21:20

I'm an infra person working on a storage product. I've been googling quite a bit to find an answer to the following question but unable to do so. Hence, I am attemping to ask the question here. I am aware that relational data or structured data can often be represented in 2-dimensional tables like DataFrames and that can be used for ML training input. If we want to store the DataFrames they can easily be stored as tables in a …

Topic: dataframe tensorflow apache-spark apache-hadoop

Category: Data Science

Word count with map reduce

def __init__

2021年10月4日 12:02

Suppose we use an input file that contains the following lyrics from a famous song: We’re up all night till the sun We’re up all night to get some The input pairs for the Map phase will be the following: (0, "We’re up all night to the sun") (31, "We’re up all night to get some") The key is the byte offset starting from the beginning of the file. While we won’t need this value in Word Count, it is …

Topic: sentiment-analysis map-reduce apache-hadoop bigdata

Category: Data Science

does storing file in hdfs parallelize it for Spark?

Ali Majed HA

2021年8月29日 03:58

For Spark's RDD operations, data must be in shape of RDD or be parallelized using: ParallelizedData = sc.parallelize(data) My question is that if I store data in HDFS, does it get parallelized automatically or I should use code above for using data in Spark? Does storing data in HDFS makes it in shape of RDD?

Topic: apache-spark apache-hadoop bigdata

Category: Data Science

KMeans using Mapreduce in Python

VM_AI

2021年8月9日 18:30

I wrote a mapreduce code in python which works locally i.e., cat test_mapper |python mapper.py sort the result, and cat sorted_map_output |python reducer.py produces the desired result. As soon as this code is submitted to the mapreduce engine, it fails: <code>21/08/09 11:03:11 INFO mapreduce.Job: map 50% reduce 0% 21/08/09 11:03:11 INFO mapreduce.Job: Task Id : attempt_1628505794323_0001_m_000001_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538) ... 21/08/09 11:03:21 INFO mapreduce.Job: map 100% reduce 100% …

Topic: map-reduce apache-hadoop bigdata

Category: Data Science

Saving Large Spark ML Pipeline to HDFS

Thomas Cleberg

2021年5月28日 12:47

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size: scala> val mod = pipeline.fit(df) mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716 scala> mod.write.overwrite().save(modelPath.concat("model")) 18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size (755610 KB). The maximum recommended task size is 100 KB. org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 2606:0 was …

Topic: scala apache-spark apache-hadoop

Category: Data Science

PySpark: java.io.EOFException

dustin

2021年4月7日 12:06

System: 1 name node, 4 cores, 16 GB RAM 1 master node, 4 cores, 16 GB RAM 6 data nodes, 4 cores, 16 GB RAM each 6 worker nodes, 4 cores, 16 GB RAM each around 5 Terabytes of storage space The data nodes and worker nodes exist on the same 6 machines and the name node and master node exist on the same machine. In our docker compose, we have 6 GB set for the master, 8 GB set …

Topic: error-handling pyspark apache-spark python apache-hadoop

Category: Data Science

Can Single Node Hadoop Cluster be installed on a system with 1gb RAM

Gaurav Parek

2021年3月29日 02:01

I am trying to learn hadoop, would like to know if for basic single node installation 1gb RAM system would be enough or we need more RAM. It would be helpful if someone can share what other minimum system requirements which I can setup single node setup. I trying to check on Apache Hadoop site but there is no specific mention of minimum system requirements for installation. Thanks

Topic: apache-hadoop

Category: Data Science

SAP HANA or Hadoop?

AswinRajaram

2021年3月10日 18:31

this is a question regarding a career choice. I am a fresher and I recently joined an MNC in Data Engineering team. There I was offered training in either Hadoop or SAP HANA. I am in doubt as to which one should I choose. Can anyone help me make the right choice? Which of these two has better scope based on the current trend? Thanks in advance.

Topic: apache-hadoop bigdata

Category: Data Science

Do I need to learn Hadoop to be a Data Scientist?

Pensu

2021年2月8日 01:59

An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?

Topic: apache-hadoop bigdata

Category: Data Science

How to run unmodified Python program on GPU servers with scheduled GPUs?

Gqqnbig

2021年1月29日 19:22

Say I have one server with 10 GPUs. I have a python program which detects available GPU and use all of them. I have a couple of users who will run python (Machine learning or data mining) programs and use GPU. I initially thought to use Hadoop, as I find Yarn is good at managing resources, including GPU, and YARN has certain scheduling strategies, like fair, FIFO, capacity. I don't like hard-coded rules, eg. user1 can only use gpu1, user2 …

Topic: gpu cloud-computing apache-hadoop clustering

Category: Data Science

What are common problems around HADOOP storage?

vwdewaal

2021年1月26日 07:00

I've been asked to lead a program to understand why our Hadoop storage is constantly near capacity. What questions should I ask? Data age, Data size? Housekeeping schedule? How do we identify the different types of compression used by different applications? How can we identify where the duplicate data sources are? Are jobs designated for edge nodes only on edge nodes?

Topic: apache-hadoop

Category: Data Science

What is the main difference between Hadoop and Spark?

Ironclad

2020年9月5日 16:30

I recently read the following about Hadoop vs. Spark: Insist upon in-memory columnar data querying. This was the killer-feature that let Apache Spark run in seconds the queries that would take Hadoop hours or days. Memory is much faster than disk access, and any modern data platform should be optimized to take advantage of that speed. Also, columnar data storage greatly reduces the amount of memory spent on empty or redundant data. Can someone explain: 1) what Apache Hadoop and …

Topic: apache-spark apache-hadoop bigdata

Category: Data Science

What is the MapReduce application master?

Tim

2020年8月23日 16:18

From Hadoop The Definitive Guide The whole process is illustrated in Figure 7-1. At the highest level, there are five independent entities: • The client, which submits the MapReduce job. • The YARN resource manager, which coordinates the allocation of compute re‐ sources on the cluster. • The YARN node managers, which launch and monitor the compute containers on machines in the cluster. • The MapReduce application master, which coordinates the tasks running the Map‐ Reduce job. The application master …

Topic: map-reduce apache-hadoop

Category: Data Science

Mapreduce jobs not working in hive

Aamir Ahmad Ansari

2020年8月23日 13:24

I was trying to execute a hive query: select name, count(*) from amazon where review != NULL group by name ; Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564815666993_0001, Tracking URL = http://aamir-VirtualBox:8088/proxy/application_1564815666993_0001/ Kill Command = …

Topic: hive apache-hadoop bigdata

Category: Data Science

How to make k-means distributed?

gsamaras

2020年8月11日 08:46

After setting up a 2-noded Hadoop cluster, understanding Hadoop and Python and based on this naive implementation, I ended up with this code: def kmeans(data, k, c=None): if c is not None: centroids = c else: centroids = [] centroids = randomize_centroids(data, centroids, k) old_centroids = [[] for i in range(k)] iterations = 0 while not (has_converged(centroids, old_centroids, iterations)): iterations += 1 clusters = [[] for i in range(k)] # assign data points to clusters clusters = euclidean_dist(data, centroids, clusters) …

Topic: map-reduce python distributed apache-hadoop k-means

Category: Data Science

Hive / Impala best practice code structuring

Gerardsson

2020年7月25日 00:03

Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases. It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop …

Topic: hive apache-hadoop

Category: Data Science

What are the use cases for Apache Spark vs Hadoop

idclark

2020年4月23日 16:00

With Hadoop 2.0 and YARN Hadoop is supposedly no longer tied only map-reduce solutions. With that advancement, what are the use cases for Apache Spark vs Hadoop considering both sit atop of HDFS? I've read through the introduction documentation for Spark, but I'm curious if anyone has encountered a problem that was more efficient and easier to solve with Spark compared to Hadoop.

Topic: cloud-computing apache-spark knowledge-base distributed apache-hadoop

Category: Data Science

BERT in production

illuminato

2020年4月10日 04:02

I've created a BERT model. What are the ways to do the deployment of this model? Is it possible to use it with Spark, Hadoop or Docker?

Topic: bert apache-spark apache-hadoop

Category: Data Science

cannot access hive from spark

user3044083

2020年3月30日 23:53

I am trying to install a hadoop + spark + hive cluster. I am using hadoop 3.1.2, spark 2.4.5 (scala 2.11 prebuilt with user-provided hadoop) and hive 2.3.3 (also tried 3.1.2 with the exact same results). All downloaded from their websites. I can run spark apps (as yarn client) with no issues, I can run hive queries directly (beeline) or via pyhive with no issues (I tried both hive-on-mr and hive-on-spark, both working fine, jobs are created by yarn and …

Topic: hive pyspark apache-spark apache-hadoop

Category: Data Science

About