Apache Spark RDD filter into two RDDs -
i need split rdd 2 parts:
1 part satisfies condition; part not. can filter
twice on original rdd seems inefficient. there way can i'm after? can't find in api nor in literature.
spark doesn't support default. filtering on same data twice isn't bad if cache beforehand, , filtering quick.
if it's 2 different types, can use helper method:
implicit class rddops[t](rdd: rdd[t]) { def partitionby(f: t => boolean): (rdd[t], rdd[t]) = { val passes = rdd.filter(f) val fails = rdd.filter(e => !f(e)) // spark doesn't have filternot (passes, fails) } } val (matches, matchesnot) = sc.parallelize(1 100).cache().partitionby(_ % 2 == 0)
but have multiple types of data, assign filtered new val.