Apache Spark RDD filter into two RDDs -

i need split rdd 2 parts:

1 part satisfies condition; part not. can filter twice on original rdd seems inefficient. there way can i'm after? can't find in api nor in literature.

spark doesn't support default. filtering on same data twice isn't bad if cache beforehand, , filtering quick.

if it's 2 different types, can use helper method:

implicit class rddops[t](rdd: rdd[t]) {   def partitionby(f: t => boolean): (rdd[t], rdd[t]) = {     val passes = rdd.filter(f)     val fails = rdd.filter(e => !f(e)) // spark doesn't have filternot     (passes, fails)   } }  val (matches, matchesnot) = sc.parallelize(1 100).cache().partitionby(_ % 2 == 0)

but have multiple types of data, assign filtered new val.

Search This Blog

hj

Apache Spark RDD filter into two RDDs -

Popular posts from this blog

title2

title3

title3