Converting word docs to pdf using Hadoop

Tag: hadoop Author: David_newyork Date: 2009-12-13

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

Best Answer

There isn't much advantage in using hadoop for this use case. Having competing consumers read from a queue and producing output is going to be a lot easier to setup and will probably be more efficient.

Hadoop would not automatically split a document and process sections on differnt nodes. Although if you had a really big (many thousands of pages long) then the Hadoop use case would make sense - but only when the time to produce a pdf on a single machine is significant.

The map tasks could print a few thousand pages each and the reduce task merge the PDF's into a single document - although reading the resulting file may be difficult to read if it is very large.

Other Answer1

Say if I want to convert 1000s of word files to pdf then would using Hadoop to approach this problem make sense? Would using Hadoop have any advantage over simply using multiple EC2 instances with job queues?

I think either tool could accomplish this task, so it depends on what you plan to do with the documents after conversion. Derek Gottfrid at the New York Times famously found Hadoop to be a useful tool for large-scale document conversion, so it's certainly within the realm of tasks at which Hadoop performs well.

Also if there was 1 file and 10 free nodes then would hadoop split the file and send it to the 10 nodes or will the file be sent to just 1 node while 9 sit idle?

It depends on the InputFormat you use. As you can see in the documentation, you can specify how to compute the "InputSplits", which might include splitting a large document into chunks.

Good luck with whatever tool you choose for this problem!

Regards, Jeff

Other Answer2

How many 1000's are you talking about? If this is a once off batch I would set it up on a single machine and simply let it run, you'll be surprised I think at how fast you can convert 1000s of Docs to PDF, even if you need to run the task for a couple of days, if its a once off convert then there is no need for complications such as Hadoop. If you are continually converting 1000s of docs then its probably worth the effort of setting up something else.