MapReduce, Chubby and Hadoop

Energy drink Red Bull sponsors air races. In New York, colorful propeller aircraft raced around and through inflated gates. The gates looked like the blow-up animals in Macy’s Thanksgiving Day Parade. When a steel-nerved air racer miscalculated and nicked an inflated course marker, the blow-up sagged like a deflating hot air balloon.

The traditional database market is in an air race with what seems to be faster, more agile and more streamlined aircraft. Structured query language or SQL databases owe their popularity to Edgar Codd’s insight that data could be held in digital form in a structure reminiscent of a piece of ledger paper. Some of you reading this column may be unfamiliar with the green ruled tablets that accountants once used to record debits and credits.

The row-and-column method still works. Even super-wizards at Google use Codd’s traditional database to fiddle with small chunks of structured data. IBM, Microsoft and Oracle have dominated the database market. Larry Ellison executed a strategic arabesque and purchased the Codd-clone MySQL, an open source competitor to the Oracle database management system. When Oracle’s hands circled the throat of MySQL, one could hear gasps and squeals from MySQL users who feared for its future. Oracle’s approach to open source has been to own it and then share. So far, Oracle’s approach seems to be working.

Open “sourciness”

Not far from Oracle’s black towers, Google’s sprawling junior college-type campus cheerleads for a different approach. Google, like Oracle, is a for-profit enterprise. But Google’s management team has been active in the open source world. In sharp contrast to Oracle’s approach, Google releases code to the open source community. To make sure that the open “sourciness” of Google is recognized, Google has an open source ambassador, open source Web pages at http://code.google.com/opensource, an open source blog, open source programs like the GSoc (Google Summer of Code) and a stealth challenge to the Codd-loving incumbents.

Some of those following Google’s business focus on advertising, which accounts for 99 percent of its revenue. However, Google launched what amounts to a tactical probe of the commercial RDBMS market. Since 2004, Google has described its MapReduce technology in a handful of publicly accessible technical papers, presentations and YouTube.com videos. Navigate to www.youtube.com and run this query: Google MapReduce. You can kick back and enjoy hours of lectures about Google’s engineering marvel. MapReduce and its pal, Chubby and GFS (the Google File System) are among Google’s core innovations. Those systems make Google’s massively parallel, distributed, high-performance computing infrastructure a reality I use several times a day. My hunch is that you rely on Google and those components as well.

Is Google just generous? Was the firm making good on its promise to not be evil? Did Google’s management team have a specific goal in mind? Was Google just being Googley?

Google’s motivations are tough to figure out. From the point of view of a marketer, “open source” is deep voodoo. The interest in open source software seems to be increasing. In search, for instance, Lucene/Solr (http://lucene.apache.org/solr) has captured the attention of a number of high-profile organizations, including Cisco Systems, eHarmony, Mitre and Twitter. In content management, I stumbled upon a conference several months ago with throngs of people entranced by Drupal. I listened to a podcast from IT Conversations (http://itc.conversationsnetwork.org) that focused on Drizzle, an open source database server, and learned that a mini-revolution is taking place against some traditional database methods and limitations.

But the buzzword of the summer is Hadoop (http://hadoop.apache.org). Bloggers and poobahs have done loop-the-loops around Hadoop. According to Doug Cutting’s “Hadoop: A Brief History” (http://research.yahoo.com/files/cutting.pdf), today’s open source wunderkind had its roots in Nutch (http://nutch.apache.org), a Web-scale open source search system. Between 2004 and 2006, Google disclosed details of its Google File System and its MapReduce method. Yahoo hired Cutting, and the Hadoop project was “split out of Nutch.” By 2008, Hadoop “hit Web scale,” and the interest accelerated.

Embracing Hadoop

Apache Hadoop, therefore, inherited some of the Google MapReduce bloodline, but the muscle behind the software framework comes from the open source community with support from Yahoo’s wizards, including Cutting, who named the framework after his child’s stuffed animal.

The basic idea behind Hadoop is that the engineering methods avoid some of the well-known problems inherent in Edgar Codd’s RDBMS invention. Hadoop supports common file systems, including Amazon’s cloud. Hadoop relies on data nodes with a “name node” providing some of the Google innovations for knowing where data are across a distributed system and keeping performance free of RDBMS-style bottlenecks. A quick trip to Amazon will provide you with one-click access to books that explain Hadoop in considerable detail. I recommend Chuck Lam’s Hadoop in Action and Tom White’s Hadoop: The Definitive Guide, and a wealth of information can be found at http://hadoop.apache.org.

Hadoop is used by Amazon, Facebook, IBM and Rackspace, among others. Commercial vendors have embraced Hadoop. IBM, for example, has several applications, including an analytics service, running on Hadoop. IBM and Google teamed in 2007 to offer university courses about Hadoop to computer science majors.

What interests me is the emergence of an open source software for “reliable, scalable, distributed computing.” Translating that, Hadoop is “a direct alternative to existing commercial operating systems, RDBMS that masquerade as high-performance data management systems, and extremely expensive proprietary solutions that have been designed to lock in licensees.”

KMWorld Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues