Which Hadoop product is more appropriate for a quick query on a large data set?

Tag: hadoop Author: li204 Date: 2009-11-25

I am researching Hadoop to see which of its products suits our need for quick queries against large data sets (billions of records per set)

The queries will be performed against chip sequencing data. Each record is one line in a file. To be clear below shows a sample record in the data set.

one line (record) looks like:

1-1-174-418 TGTGTCCCTTTGTAATGAATCACTATC U2 0 0 1 4 ***103570835*** F .. 23G 24C

The highlighted field is called "position of match" and the query we are interested in is the # of sequences in a certain range of this "position of match". For instance the range can be "position of match" > 200 and "position of match" + 36 < 200,000.

Any suggestions on the Hadoop product I should start with to accomplish the task? HBase,Pig,Hive, or ...?

Other Answer1

Rough guideline: If you need lots of queries that return fast and do not need to aggregate data, you want to use HBase. If you are looking at tasks that are more analysis and aggregation-focused, you want Pig or Hive.

HBase allows you to specify start and end rows for scans, meaning it should be satisfy the query example you provide, and seems most appropriate for your use case.

Other Answer2

For posterity, here's the answer Xueling received on the Hadoop mailing list:

First, further detail from Xueling:

The datasets wont be updated often. But the query against a data set is frequent. The quicker the query, the better. For example we have done testing on a Mysql database (5 billion records randomly scattered into 24 tables) and the slowest query against the biggest table (400,000,000 records) is around 12 mins. So if using any Hadoop product can speed up the search then the product is what we are looking for.

The response, from Cloudera's Todd Lipcon:

In that case, I would recommend the following:

  1. Put all of your data on HDFS
  2. Write a MapReduce job that sorts the data by position of match
  3. As a second output of this job, you can write a "sparse index" - basically a set of entries like this:

where you're basically giving offsets into every 10K records or so. If you index every 10K records, then 5 billion total will mean 100,000 index entries. Each index entry shouldn't be more than 20 bytes, so 100,000 entries will be 2MB. This is super easy to fit into memory. (you could probably index every 100th record instead and end up with 200MB, still easy to fit in memory)

Then to satisfy your count-range query, you can simply scan your in-memory sparse index. Some of the indexed blocks will be completely included in the range, in which case you just add up the "number of entries following" column. The start and finish block will be partially covered, so you can use the file offset info to load that file off HDFS, start reading at that offset, and finish the count.

Total time per query should be <100ms no problem.

A few subsequent replies suggested HBase.

Other Answer3

You could also take a short look at JAQL (http://code.google.com/p/jaql/), but unfortunately it's for querying JSON data. But maybe this helps anyway.


So it's not for Hadoop, and it won't handle his data, and generally has nothing to do with the problem at hand, but it might be useful anyway?
Well, did you read the abstract at the main page? "Jaql is a query language designed for Javascript Object Notation (JSON), a data format that has become popular because of its simplicity and modeling flexibility. Jaql is primarily used to analyze large-scale semi-structured data. Core features include user extensibility and parallelism. In addition to modeling semi-structured data, JSON simplifies extensibility. Hadoop's Map-Reduce is used for parallelism." So in fact it is for Hadoop! ;-)

Other Answer4

You may need to look at No-SQL Database approaches like HBase or Cassandra. I would prefer HBase, as it has a growing community.