How to pass arguments to streaming job on Amazon EMR -
i want produce output of map function, filtering data dates.
in local tests, call application passing dates parameters as:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py
then parameters taken in map function
#!/usr/bin/python date1 = sys.argv[1]; date2 = sys.argv[2];
the question is: how pass date parameters map calling on amazon emr
?
i beginner in map reduce
. appreciate help.
first of all, when run local test, , should possible. correct format (in order reproduce how map-reduce works) is:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 | sort | ./reducer.py | sort
that way hadoop framework works.
if looking on big file, should in steps verify results of each line.
meaning:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 > map_result.txt cat map_result.txt | sort > map_result_sorted.txt cat map_result_sorted.txt | ./reducer.py > reduce_result.txt cat reduce_result.txt | sort > map_reduce_result.txt
in regard main question:
same thing.
if going use amazon web console create cluster, in add step window write fallowing:
name: learning amazon emr
mapper: (here say: please give s3 path mapper, ignore that, , write our script name , parameters, no backslash...) mapper.py 20/12/2014 31/12/2014
reducer: (the same in mapper) reducer.py (you can add here params too)
input location: ...
output location: ... (just remember use new output every time, or task fail)
arguments: -files s3://cod/mapper.py,s3://cod/reducer.py (use file path here, if add 1 file use -files argument)
that's it
if going argument thing, suggest see this guy blog on how use passing of arguments in order use single map,reduce file.
hope helped