How to pass arguments to streaming job on Amazon EMR -


i want produce output of map function, filtering data dates.

in local tests, call application passing dates parameters as:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py 

then parameters taken in map function

#!/usr/bin/python date1 = sys.argv[1]; date2 = sys.argv[2]; 

the question is: how pass date parameters map calling on amazon emr?

i beginner in map reduce. appreciate help.

first of all, when run local test, , should possible. correct format (in order reproduce how map-reduce works) is:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 | sort | ./reducer.py | sort 

that way hadoop framework works.
if looking on big file, should in steps verify results of each line.
meaning:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 > map_result.txt cat map_result.txt | sort > map_result_sorted.txt cat map_result_sorted.txt | ./reducer.py > reduce_result.txt cat reduce_result.txt | sort > map_reduce_result.txt 

in regard main question:
same thing.

if going use amazon web console create cluster, in add step window write fallowing:

name: learning amazon emr
mapper: (here say: please give s3 path mapper, ignore that, , write our script name , parameters, no backslash...) mapper.py 20/12/2014 31/12/2014
reducer: (the same in mapper) reducer.py (you can add here params too)
input location: ...
output location: ... (just remember use new output every time, or task fail)
arguments: -files s3://cod/mapper.py,s3://cod/reducer.py (use file path here, if add 1 file use -files argument)

that's it

if going argument thing, suggest see this guy blog on how use passing of arguments in order use single map,reduce file.

hope helped


Popular posts from this blog