Controlling number of lines to be written to the output file

Tag: hadoop Author: jason_qiao Date: 2011-09-11

I am new to Hadoop programming.

I have a situation in which I want to stop writing <k3,v3> to my output file after n-lines.

In my program, I am sure that the output file will be sorted according to k3, but I don't want the entire list. I only want the first n.

Is there a mechanism in Hadoop to do this?

Can you please give input -> output sample and your mapper/reducer codes.

Other Answer1

I couldn't find an Class/API for the same.

But, you could increment a Counter when the OutputCollector.collect() is called in the Reduce function. When the counter reaches the a certain value, stop calling the OutputCollector.collect().

It's a waste of CPU cycles because the reduce tasks keeps on running even after n lines are written to the o/p. There might be a better approach for the problem.

comments:

Thank you praveen. I thought of the same approach and went on to look if there is a better way.