Yesterday during his keynote at HadoopWorld 2011, Apache Hadoop creator and Cloudera employee Doug Cutting announced that the next version of Cloudera’s Distribution Including Hadoop (CDH4) will be based on Apache BigTop. But what exactly is BigTop?
Eli Collins, one of the key developers working on BigTop at Cloudera, appeared on theCube to talk about it, and I also talked to Cutting and Collins about it offline. The short answer is: BigTop is CDH as an Apache Incubator project. The initial commit of BigTop is in fact the CDH codebase. From now on, development of CDH will take place first at BigTop, with CDH being based on BigTop rather than vice versa.
CDH has always been a free open source project – Cloudera Enterprise, a collection of proprietary management tools, is Cloudera’s paid product. The BigTop incubator project is a step towards making CDH a full fledged Apache project.
This is a big deal for Cloudera as it tries to maintain its lead in the Hadoop market as more distributions, from companies like HortonWorks and MapR, are released. Having Cloudera’s own distribution become the standard distribution from Apache would be a big deal.
To understand the difference between CDH/BigTop and the core Apache Hadoop project, Cutting in his keynote compared the main Hadoop project to the Linux kernel. Most of us, incorrectly refer to all distributions of GNULinux as simply “Linux,” even though the name Linux only refers to the kernel. It’s a convenient short hand. The full operating systems are actually distributions such as Debian, Red Hat and Ubuntu. Likewise, Apache Hadoop refers to a core selection of tools, but there is a bigger stack of tools that are becoming central to the use of Hadoop, such as Hbase, Pig, Hive and Avro. BigTop/CDH includes all these separate projects and brings them under one big tent.
In his keynote, Cutting said that nothing is sacred at Apache – what we now called Hadoop could be replaced with something else. Presumably though BigTop would live on, even if the core Apache Hadoop tools were swapped out eventually with something else (not that that’s going to happen any time soon). By being making its distribution more closely aligned with Apache, Cloudera is positioning itself to be a leader even if there’s a big shift in the underlying technology.