Fastest access of a file using Hadoop

Tag: hadoop Author: cheyunxing Date: 2011-09-17

I need fastest access to a single file, several copies of which are stored in many systems using Hadoop. I also need to finding the ping time for each file in a sorted manner. How should I approach learning hadoop to accomplish this task? Please help fast.I have very less time.

Best Answer

If you need faster access to a file just increase the replication factor to that file using setrep command. This might not increase the file throughput proportionally, because of your current hardware limitations.

The ls command is not giving the access time for the directories and the files, it's showing the modification time only. Use the Offline Image Viewer to dump the contents of hdfs fsimage files to human-readable formats. Below is the command using the Indented option.

bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt

A sample o/p from the fsimage.txt, look for the ACCESS_TIME column.

INODE
  INODE_PATH = /user/praveensripati/input/sample.txt
  REPLICATION = 1
  MODIFICATION_TIME = 2011-10-03 12:53
  ACCESS_TIME = 2011-10-03 16:26
  BLOCK_SIZE = 67108864
  BLOCKS [NUM_BLOCKS = 1]
    BLOCK
      BLOCK_ID = -5226219854944388285
      NUM_BYTES = 529
      GENERATION_STAMP = 1005
  NS_QUOTA = -1
  DS_QUOTA = -1
  PERMISSIONS
    USER_NAME = praveensripati
    GROUP_NAME = supergroup
    PERMISSION_STRING = rw-r--r--

To get the ping time in a sorted manner, you need to write a shell script or some other program to extract the INODE_PATH and ACCESS_TIME for each of the INODE section and then sort them based on the ACCESS_TIME. You can also use Pig as shown here.

How should I approach learning hadoop to accomplish this task? Please help fast.I have very less time.

If you want to learn Hadoop in a day or two it's not possible. Here are some videos and articles to start with.