Mapper with multipleInput on Hadoop cluster

Tag: hadoop , mapreduce Author: qmc0823126 Date: 2014-01-08

I have to implement two mapReduce jobs where a Mapper in phase II (Mapper_2) needs to have an output of the Reducer in phase I (reducer_1).

Mapper_2 also needs another input that is a big text file (2TB).

I have written as follows but my question is: text input will be split amongst nodes in the cluster, but what about output of reducer _1 as I want each mapper in phase II to have the whole of Reducer_1's output.

MultipleInputs.addInputPath(Job, TextInputPath, SomeInputFormat.class, Mapper_2.class);
MultipleInputs.addInputPath(Job, Ruducer_1OutputPath, SomeInputFormat.class, Mapper_2.class);

Other Answer1

Your use of multiple inputs seems fine. I would look at using the distributed cache for the output of reducer_1 to be shared with mapper_2.

JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/path/to/reducer_1/ouput"), 

Also, when using Distributed Cache, remember to read the cache file in the setup method of mapper_2.

setup() runs once for each mapper before map() gets called and cleanup() runs once for each mapper after the last call to map()


Thanks for your answer, How can I access data in distributedCache when I am writing Mapper_2 code?