does storing file in hdfs parallelize it for Spark?

Question

does storing file in hdfs parallelize it for Spark?

Ali Majed HA

2021年8月29日 03:58

For Spark's RDD operations, data must be in shape of RDD or be parallelized using:

ParallelizedData = sc.parallelize(data)

My question is that if I store data in HDFS, does it get parallelized automatically or I should use code above for using data in Spark? Does storing data in HDFS makes it in shape of RDD?

Topic apache-spark apache-hadoop bigdata

Category Data Science

Felipe Bormann · Accepted Answer · 2021年8月29日 03:58

As you can see for the following examples presented here at the documentation, you'd be able to read directly from HDFS without much trouble. And spark will parallelize the data for you properly.

We only use parallelize when you're coming data structures that were built in Scala themselves (like a val assignment or something like that).

does storing file in hdfs parallelize it for Spark?

About