does storing file in hdfs parallelize it for Spark?

For Spark's RDD operations, data must be in shape of RDD or be parallelized using:

ParallelizedData = sc.parallelize(data)

My question is that if I store data in HDFS, does it get parallelized automatically or I should use code above for using data in Spark? Does storing data in HDFS makes it in shape of RDD?

Topic apache-spark apache-hadoop bigdata

Category Data Science


As you can see for the following examples presented here at the documentation, you'd be able to read directly from HDFS without much trouble. And spark will parallelize the data for you properly.

We only use parallelize when you're coming data structures that were built in Scala themselves (like a val assignment or something like that).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.