Accumulators in Spark (PySpark) without global variables?
BACKGROUND
Consider the following textbook example which uses accumulators to add vectors.
from pyspark import AccumulatorParam
class VectorAccumulatorParam(AccumulatorParam):
def zero(self, value):
dict1 = {i: 0 for i in range(0, len(value))}
return dict1
def addInPlace(self, val1, val2):
for i in val1.keys():
val1[i] += val2[i]
return val1
rdd1 = sc.parallelize([{0: 0.3, 1: 0.8, 2: 0.4},
{0: 0.2, 1: 0.4, 2: 0.2},
{0: -0.1, 1: 0.4, 2: 1.6}])
vector_acc = sc.accumulator({0: 0, 1: 0, 2: 0},
VectorAccumulatorParam())
def mapping_fn(x):
global vector_acc
vector_acc += x
rdd1.foreach(mapping_fn)
print(vector_acc.value)
This prints {0: 0.4, 1: 1.6, 2: 2.2}
, i.e., the element-wise sum of the vectors.
QUESTION
The use of the global
scope in mapping_fn()
gnaws at me, since it's usually bad practice. Is there a simple way to illustrate how accumulators work without resorting to a global variable? This is all the more confusing since, as far as I understand, the beauty of Spark lies in a share nothing philosophy, and global
is precisely the opposite of that.
Topic pyspark apache-spark python apache-hadoop
Category Data Science