Scala RDD operation

I am new in scala.

I have a csv file stored in hdfs. I am reading that file in scala using

  val salesdata = sc.textFile("hdfs://localhost:9000/home/jayshree/sales.csv")

Here is a small sample of data sales. a is the customerid, b-transanctionid, c-itemid, d-itemprice.

 a    b  c    d
 5  199 1   500
 33 235 1   500
 20 249 3   749
 35 36  4   757
 19 201 4   757
 17 94  5   763
 39 146 5   763
 42 162 5   763
 49 41  6   824
 3  70  6   824
 24 161 6   824
 48 216 6   824

I have to perform the following operation on it.

  1. Apply some discount on each item, on the column d(itemprice) suppose 30% of discount. The formula will be d-(30%(d))
  2. Find customer wise minimum and maximum item value after applying 30% discount to each item.

I tried to multiply 30 with the observation of column d. The problem is that the value of d as taken as string. When I am multiplying with a number in result it is show the value that many time.

I can convert it into a dataframe and do it. But I just want to know that without converting it into a dataframe can we do these operation for a RDD.

Topic scala apache-spark

Category Data Science


To find max and min,

var path = "filePath"    

var rdd = spark.sparkContext.textFile(path)
val headers = rdd.first()
val data_without_header=rdd.filter(line => !line.equals(headers))
data_without_header.foreach(println)

val salary_list= data_without_header.map{x => x.split(',')}.map{x=>(x(3).toInt) - (x(3).toInt)*(0.3)}

println("Max salary:" + salary_list.max())
println("Min salary:" + salary_list.min())



For the first you can do as follow :

val discount = salesdata.map( str => str.split(","))
                        .map( array => (array(0), array(1), array(2), array(3).toDouble) )
                        .map{ case(a, b, c, d) => (a, b, c, d-0.3*d)}

I'm not sure to understand the second, this will gives you the min and max per c-itemID

val productPrices = discount.map{ case(a, b, c, d) => (c,(d,d)) }

val minMaxPerItemRDD = productPrices.reduceByKey{ case((min1,max1),(min2,max2)) => (math.min(min1,min2), math.max(max1, max2))}

Hope that's what you need.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.