Aggregate separate Flume streams in Spark -


i researching ability "realtime" logprocessing in our setup , have question on how proceed.

so current setup (or intend it) follow:

  • server generates logfiles through rsyslog folder per customer.
  • server b generates logfiles through rsyslog folder per customer.

both server , b generate 15 logfiles (1 per customer) in folder per customer, structure looks this:

/var/log/customer/logfile.log 

on server c have flume sink running listens rsyslog tcp messages server , server b. testing have 1 flume sink 1 customer, think need 1 flume sink per customer.

this flume sink forwards these loglines spark application should aggregate results per customer.

now question is: how can make sure spark (streaming) aggregate results per customer? let's each customer have it's own flume sink, how can make sure spark aggregates each flume stream separately , doesn't mix 2 or more flume streams together?

or kafka more suitable kind of scenario?

any insights appreciated.

you can use kafka customer id partition key. basic idea in kafka message can have both key , value. kafka guarantees messages same key go same partition (spark streaming understands concept of partitions in kafka , lets have have separate node handling every partition), if want can use flume's kafka sink write messages kafka.


Popular posts from this blog