Is it possible to implement an rdd version of a for loop having map and reduce using pyspark?
I need to test an algorithm that computes a function on a dataframe where in each execution I drop a column and computes the function. This is a example in python pyspark but without using rdd:
df2581=spark.sparkContext.parallelize([Row(a=1 ,b=3,c=5,d=7,e=9)]).toDF()
df2581.show()
wo = df2581.rdd.flatMap(lambda x: x[1:] ).map(lambda a:print(type(a)))
wo.collect()
def f(x):
list3 = []
index = 0
list2 = x
for j in x:
list = array(x)
list.remove(list[index])
list3 = list.copy()
index += 1
return list3
colu= df2581.columns
def add(x,y):
return x+y
arr =[]
for i in range(0,len(colu)):
words = df2581.rdd.map(lambda x: x[i:] ).reduce(lambda a,b:a+b)
sum= spark.sparkContext.parallelize(words).reduce(add)
arr.append(sum)
I need to know if there is a possibility to perform such algorithm but using rdd ruther than for loop (pyspark)
Topic pyspark apache-spark python bigdata
Category Data Science