Can Hadoop be restricted to spare CPU cycles?

Tag: hadoop Author: doyydmyself Date: 2009-10-06

Is it possible to run Hadoop so that it only uses spare CPU cycles? I.e. would it be feasible to install Hadoop on peoples work machines so that number crunching can be done when they are not using their PCs, and they wouldn't experience an obvious performance drain (whurring fans aside!).

Perhaps it's just be a case of setting the JVM to run at a low priority and not use 'too much' network (assuming such a thing is possible on a windows machine)?

If not, does anyone know of any Java equivalents to things like BOINC?

Edit: Found a list of Cycle Scavenging Infrastructure here. Although my question about Hadoop still stands.

Best Answer

This is very much outside the intended usage for Hadoop. Hadoop expects all of its nodes to be fully available and networked for optimal throughput -- not something you get with workstations. Furthermore, it doesn't even really run in Windows (you can use it with cygwin, but I don't know anyone using that for "production" -- except as client machines issuing jobs).

Hadoop does things like store data chunks on a few of the nodes, and try to schedule all computation on that data on those nodes; in a work-sharing environment, that means a task that needs this data will want to run on those three workstations -- regardless of what their users are doing at the moment. In contrast, "cycle scavenging" projects keep all the data elsewhere, and ship it and a task to any node that's available at a given moment; this enables them to be nicer to the machines, but it incurs obvious data transfer costs.


It can be run on windows in "production" however having seen it done i highly recommend against it.

Other Answer1

Perhaps Terracotta is something more up your alley?

Terracotta Product Link