Variable/looping sequence of jobs

Tag: hadoop Author: jun583710105 Date: 2010-08-17

I'm considering using hadoop/mapreduce to tackle a project and haven't quite figured out how to set up a job flow consisting of a variable number of levels that should be processed in sequence.

E.g.:

Job 1: Map source data into X levels.
Job 2: MapReduce Level1 -> appends to Level2
Job 3: MapReduce Level2 -> appends to LevelN
Job N: MapReduce LevelN -> appends to LevelN+1

And so on until the final level. The key is that each level must include its own specific source data as well as the results of the previous level.

I've looked at pig, hive, hamake, and cascading, but have yet to see clear support for something like this.

Does anyone know an efficient way of accomplishing this? Right now I'm leaning towards writing a wrapper for hamake that will generate the hamake file based on parameters (the number of levels is known at runtime but could change with each run).

Thanks!

Other Answer1

Oozie http://yahoo.github.com/oozie/ is an Open Source server that Yahoo released to manage Hadoop & Pig workflow like you are asking

Cloudera has it in their latest distro with very good documentation https://wiki.cloudera.com/display/DOC/Oozie+Installation

here is a video http://sg.video.yahoo.com/watch/5936767/15449686 from Yahoo

Other Answer2

You should be able to generate the pig code for this pretty easily using Piglet, the Ruby Pig DSL: http://github.com/iconara/piglet