Running jobs parallely in hadoop

Tag: hadoop Author: l3051708910 Date: 2011-09-04

I am new to hadoop.

I have set up a 2 node cluster.

How to run 2 jobs parallely in hadoop.

When i submit jobs, they are running one by one in FIFO order. I have to run the jobs parallely. How to acheive that.

Thanks MRK

Best Answer

Hadoop can be configured with a number of schedulers and the default is the FIFO scheduler.

FIFO Schedule behaves like this.

Scenario 1: If the cluster has 10 Map Task capacity and job1 needs 15 Map Task, then running job1 takes the complete cluster. As job1 makes progress and there are free slots available which are not used by job1 then job2 runs on the cluster.

Scenario 2: If the cluster has 10 Map Task capacity and job1 needs 6 Map Task, then job1 takes 6 slots and job2 takes 4 slots. job1 and job2 run in parallel.

To run jobs in parallel from the start, you can either configure a Fair Scheduler or a Capacity Scheduler based on your requirements. The mapreduce.jobtracker.taskscheduler and the specific scheduler parameters have to be set for this to take effect in the mapred-site.xml.

Edit: Updated the answer based on the comment from MRK.

comments:

Thank You. Let me explore on other schedulers.
HI, Thanks for the inputs. Here is my observation. Both of your points Valid as per my observation. Hadoop runs the jobs in FIFO order by default. But if the cluster has 4 Map Task capacity and job1 needs 2 maps only, then its running job2 also in parallel eventhough its using FIFO scheduler.The Fair Scheduler allocate the tasks evenly for all the submitted jobs so that short jobs completes soon eventhough they submitted after long running jobs. So jobs can run in parallel in FIFO mode also if the Cluster has more capacity. In case of Fair scheduling jobs always run in parallel.

Other Answer1

You have "Map Task Capacity" and "Reduce Task Capacity". Whenever those are free they would pick the job in FIFO order. Your submitted jobs contains mapper and optionally reducer. If your jobs mapper (and/or reducer) count is smaller then the cluster's capacity it would take the next jobs mapper (and/or reducer).

If you don't like FIFO, you can always give priority to your submitted jobs.

Edit:

Sorry about slight missinformation, Praveen's answer is the right one. in adition to his answer you can check HOD scheduler aswell.

comments:

Please see my explanation in the answer.
sorry, didn't get what you mean.
I was composing the answer:) Giving priority to a job submitted to FIFO scheduler won't solve the problem of running the jobs in parallel, because in a FIFO jobs run in a sequence consuming the entire cluster. Another scheduler has to be configured.
got your point there :) your answer is more appropriate
HI, Thanks for the inputs. Here is my observation.

Other Answer2

With the default scheduler only one job per user at a time. You can launch different jobs from different user ids. They will run in parallel, of course, as mentioned by others you need to have enough slot capacity.