Large-scale Hadoop @ Yahoo! Search
19 Feb 2008Jeremy Zadowny has mention of a very large scale Hadoop deployment over at Yahoo! Search.
It makes sense given their previous commitments and investments to the project but it’s also cool in a way to start seeing some significant migrations to the framework.
Over on the Yahoo! Hadoop blog, you can read about how the webmap team in Yahoo! Search is using the Apache Hadoop distributed computing framework. They’re using over 10,000 CPU cores to build the map and processing a ton of data to do so. They end up using over 5 petabytes of raw disk storage, eventually outputting over 300 terabytes of compressed data that’s used to power every single search.
Another interesting quote from Eric Baldeschwieler (Senior Director, Grid Computing):
This process is not new (see the AltaVista connectivity server). What is new is the use of Hadoop. Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took. It does that while simplifying administration. Further we believe that as we continue to scale up Hadoop, we will be able to scale up our production jobs as needed to larger cluster sizes.
Pretty impressive.
As part of this announcement, Jeremy has posted an interview he did with a couple of the webmap and grid computing people. The video feed seems quite slow right now so you’ll have to be patient.
Update: The video feed seems much better now. Check it out.