First Impressions of Hadoop

The other night I sat down and spent some time playing around with Hadoop.

What follows here is based on my brief understanding of the project and one nights worth of experience :)

Hadoop is an Apache Lucene project that provides an open-source implementation of MapReduce. MapReduce is a programming model emphasizing parallel processing that has been developed and popularized by Google. From everything I’ve read and seen of Google’s MapReduce implementation, Hadoop looks to be very similar. Interestingly enough, Yahoo! is a significant supporter of this project and has widely published and presented on their experiences.

Having a general conceptual understanding of MapReduce but little experience with it, I set out to solve a relatively straight forward problem.

As part of our interviewing process, potential candidates are given a programming problem to solve. One that we’ve used previously involved the parsing of apache-style log files and gathering specific statistics. It wasn’t a stretch to see how this could be implemented using MapReduce and Hadoop.

Step 1: Installing Hadoop

My daily driver is a MacBook Pro so the installation went quite smoothly. Nothing significantly more difficult than unpacking an archive, updating a few environment variables, and running a few example jobs.

I setup a two machine cluster – my laptop and another VMware virtual machine. Nothing particularly sophisticated.

Step 2: Breaking down the problem

The first step of any implementation involves breaking the problem down into a series of map and reduction steps. Hadoop provides interfaces and abstract base classes that help you get started. Arguably the process of breaking any problem down into its fundamental building blocks is a significant challenge, but at the same time it’s part of the elegance enabled by a programming model such as MapReduce.

It’s important to note that the framework is responsible for all I/O operations and actually builds this functionality on top of it’s own implementation of a distributed file system. In order to make data available for processing, you must first copy it to the distributed file system. Likewise, if you want to do additional work on output data (and not do it using Hadoop), you must copy it from the distributed file system. This not only enables larger volumes of data to be made available to the cluster but it also allows the framework to partition and store records closer to the machines that will be processing them. Google has an interesting paper that describes their approach to distributed file systems (See Google File System)

In the simplest terms, a mapping step is responsible for processing input data and generating key -> value pairs. Hadoop takes care of aggregating all values with their corresponding keys and passing this result (key -> collection of values) to a reduction step. A reduction step is responsible for reducing the collection of values into a suitable output value. See the Hadoop quick start guide for a simple implementation of both a map and reduce step.

Step 3: The mapping step

Take the following problem definition: Given a collection of apache log files determine the urls with the most hits and return them in descending order.

The mapping step is fairly straight forward. It’s responsible for parsing individual log records and outputting a url and count for each.

In this example, the count would always be 1 and you can think of this mapping step as a marker. It’s responsible for flagging the url from each record it parses. It’s the reduction step’s responsibility to count up all of the individual flags.

Step 4: The reduction step

Example Input: [http://www.google.ca, [1,1,1,1,1,1,1]], [http://www.yahoo.com, [1,1,1], [http://www.microsoft.com, [1,1,1,1,1]]

As mentioned previously, the responsibility of the reduction step in this example is to reduce these inputs by counting up the number of flags. The output would look something like:

[http://www.google.ca, 7], [http://www.yahoo.com, 3]

Step 5: Putting it all together

That’s it. The result of running a job with these map and reduce steps would be similar to:

http://www.google.ca 7

http://www.yahoo.com 3

http://www.microsoft.com 5

Unfortunately the outputs are sorted by value and we haven’t yet quite satisfied the problem definition. However, the beauty is that we can now pass this output through another mapper that will invert the key/value and then sort by key. The output of that job will be a list of urls sorted by their hit count. Success!

All in all it’s an incredibly powerful idea and framework. Absolutely overkill for small tasks like these but if you’ve got a many terabytes (or petabytes) of data and a couple thousand processing nodes, it’s more than adequate.