What is Hadoop? [closed]

Tag: hadoop Author: yilushang Date: 2009-09-25

What is Hadoop?

I have gone through Google and Wikipedia, but I am not clear on what Hadoop actually is and what is the goal of it is.

Best Answer

I can only guess that you're not really asking what Hadoop is but what MapReduce is because that's basically what Hadoop is. I'd suggest looking at MapReduce: Simplified Data Processing on Large Clusters.

Search engines are an obvious application for this but there are others. The sort of algorithm this is used to solve is where:

  1. There are huge amounts of data;
  2. That data by necessity is distributed; and
  3. The problem can be highly parallelized.

Or, to put it another way, Hadoop is designed as a distributed work manager for huge amounts of data on a large number of systems. But Hadoop is more than that in that it's also about monitoring, failover and scheduling.

Here are some example applications.

Google often uses the example of creating indexes. The Map operation takes a set of Web pages and creates keyword indexes. The reduce operation takes distributed keyword maps and combines them.


@Cletus: Thank you for the information. This was something I was looking for.

Other Answer1

The definition of MapReduce is a bit hard for some to get a grasp of, so here is a simple example to get you into the right thought mode. Let's begin with a standard Google interview question:

Assume you have a 10 TB file of search queries and 10 compute nodes. A search query is a line about 50 characters in length, that contains the word or phrase someone has entered for search.

Your task is to find the top 1000 search phrases.


  1. You only get to scan the 10 TB file ONCE
  2. You haven't any statistics or models about the search phrases
  3. The compute nodes must all be used roughly equally.


Using a hash function (in this area also known as a global hash function) distribute the work amongst each of the node by the following process.

  1. Give each node a unique ID between 0-9
  2. Upon every line scan of the 10 TB file compute the hash of the line
  3. Quantize the hash value between 0-9 in a uniform manner
  4. Send that line to the compute node with the same ID

This process tries to ensure (based on a Poisson distribution) that each node will get roughly the same amount of work. Any other approach relies on models or statistics about the search data to evenly distribute the load of work (which is not feasible when you don't know what the data is).

Hadoop uses a distributed hash table (DHT) and some other nifty things to provide amongst other things such workload balancing, distributed storage access, etc.

Other Answer2

Disclosure: I do not have any exposure to Hadoop. Some of the descriptions below may fall short of "technically accurate", and this would be due to both my own ignorance and to my attempt at putting things in simple terms, in the spirit of Rachel's original question and various "redirects" following Andrew Hare's responses.

Please feel free to leave remarks and annotations (or even to directly edit this response) if you can improve the accuracy of the text while keeping the material accessible to non practitioners.

Hadoop allows to runs distributed applications "in the cloud", i.e. on a dynamic collection of commodity hardware hosts.

In very broad terms, the idea is to a run a task in parallel fashion and without having to worry [much] about where the individual portions of the task take place, nor where the data files associated with the task (whether read or write), reside.

Examples, what can and what cannot be "Hadooped"
Search engines are a typical example of task that lend themselves to be handled by Hadoop or similar systems, but other examples include big sorting jobs, document categorization/clustering, SVD and other linear algebra operations (obviously with "big" matrices) etc. A required characteristic of the tasks to be handled with Hadoop however is that they can be described in terms of two functions a Map() function which is used to split the task in several structurally identical pieces and a Reduce() function which takes individual"pieces" and processes them by producing mergeable results.

The idea about the MapReduce model is that it provides a generic and clear description for anything that can be processed in parallel. This helps both application developers who just provide the MapReduce function pair, but otherwise do not have to worry about synchronization details etc. and the Hadoop system itself which can then handle every job in the very same fashion (even though the implementation of the Map/Reduce functions may vary tremendously).

Another key component of Hadoop is the distributed file system which allows storing reliably huge amount of data (petabyte-sized). Hadoop includes several additional subsystems, features and configuration parameters which collectively allow producing fault tolerant, dynamic systems.

A node is a computer host which can run Map() or Reduce() jobs (as well as adhere to the protocole of the Hadoop system, i.e. "obey" the Hadoop manager which orchestrates all these moving parts.) A hadoop system can have several thousands nodes, i.e. potential workers to which it can feed jobs.
Petabyte = 1 Million Gigabytes.
Cloud Computing Computing based on dynamically scalable groups of general purpose computers (physical or virtual) available over a network (or more generically over the Internet). Two key concepts are: a) that applications using the cloud have no knowledge of nor direct control over the set of resources which services their requests at a given time. b) No software specific a given application gets installed "ahead of time" on the nodes (well... they do receive, and possibly cache the logic of the Map() or Reduce() function, and they do benefit from the current state of the file system with various datasets belonging to the application, but the idea is that they are "commodity". They can come and go, and yet the work will get done).


If I could just up-vote this 1000 times .... damn ...

Other Answer3

There is a beautiful tutorial given at Yahoo developers network. You will like it. It is for beginners who are new to hadoop.

Other Answer5

Hadoop is framework with combination of different framework like map reduce,hdfs,hbase,hive.

HDFS stores the data blocks in the form of files in cluster nodes.There was no tables,no columns in hdfs.

Map Reduce is the powerful parallel processing of data located in clustered nodes.

Hive is datawarehousing tool and SQL wrapper for processing large amount of data. Hive can be used for olap processing.

HBase is a database on top of hdfs. Hbase can be used for realtime processing i.e OLTP processing.

Please click what is Hadoop to know more on Basics of Hadoop and different sub components

Other Answer6

Please see Hadoop:

Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.


@Andrew: I have gone through Wikipedia but would appreciate any detail explanation as I did not understood more from Wiki :)
What specifically do you not understand?
@Andrew: It enables applications to work with thousands of nodes and petabytes of data.
@Andrew: What kind of work, what is motivation of creating Hadoop, What real times issues is solved by Hadoop, what earlier technologies were there which did not satisfy the need which lead to development of Hadoop?

Other Answer7

Hadoop is a framework that is mainly used for data intensive or big data processing in distributed computing and it is based on Google Mapreduce paper . First you need to understand what is mapreduce then only go to hadoop framework Check this ppt


I find it is the most helpfull and fun way to get to know MapReduce

Other Answer8

Refer " Oreilly.Hadoop.The.Definitive.Guide ". This will give you a good idea about hadoop.


A one-liner that's little better than a link to a three-year-old question with a number of complete, cogent responses is not terribly useful as an answer. That sort of thing works better as a comment.