hadoop - Which will give the best performance Hive or Pig or Python Mapreduce with text file and oracle table as source? -


i have below requirements , confused 1 choose high performance. not java developer. comfort hive, pig , python.

i using hdp2.1 tez engine. data sources text files(80 gb) , oracle table(15gb). both structured data. heard hive suite structure data , python map reduce streaming concept have high performance hive & pig. please clarify.

i using hive , reasons are:

  • need join 2 sources based on 1 column.
  • using orc format table store join results since data size huge
  • text file name used generate 1 output column , has been performed virtual column concept input__file__name field.
  • after join need arithmetic operations on each row , doing via python udf

now complete execution time data copy hdfs final result taken 2.30 hrs 4 node cluster using hive , python udf.

my questions are:

1) heard java mapreduce faster. true python map reduce streaming concept too?

2) can achieve above functions in python join, retrieval of text file name, compressed data flow orc since volume high?

3) pig join better hive? if yes can input text file name in pig generate output column?

thanks in advance.

  1. python map reduce or using hadoop streaming interface slower. due overhead of passing data through stdin , stdout , implementation of streaming api consumer (in case python). python udf's in hive , pig same thing.

  2. you might not want compress data flow orc on python side. you'll subjected using python's orc libraries, not sure if available. easier if let python return serialized object , hadoop reduce steps compress , store orc (python udf computation)

  3. yes. pig , python have of nice programmatic interface in can write python scripts dynamically generate pig logic , submit in parallel. embedding pig latin in python. it's robust enough define python udfs , let pig overall abstraction , job optimization. pig lazy evaluation in cases of multiple joins or multiple transformations can demonstrate pretty performance in optimizing complete pipe line.

you hdp 2.1. have had @ spark ? if performance important , looking @ datasets size dont huge ll expect many time faster overall pipeline execution hadoop s native mr engine


Popular posts from this blog