Google and VMware take very different approaches to solving the problem of rapidly provisioning large numbers of virtual nodes for production Big Data systems, says Wikibon CTO and leading Big Data Analyst David Floyer in his latest Wikibon Alert, “Google and VMware Provide Virtualization of Hadoop and Big Data”. Google’s Compute Engine, announced at the Google I/O Developers Conference in June, is an Infrastructure-as-a-Service (IaaS) approach built on the Google Cloud Storage service and the KVM hypervisor on Linux. VMware takes an in-house approach based on vSphere with Project Serengeti, announced at the Hadoop Summit, also in June in San Jose. Both have advantages and challenges, but overall Floyer finds that the Google approach, and similar IaaS systems offered by competitors such as Microsoft Azure, is likely to be more attractive to large enterprises overall.
The Google Compute Engine provides 700,000 virtual cores that users can spin up and tear down rapidly for Big Data in general and MapReduce and Hadoop in particular, without the need to set up any internal data center infrastructure. It includes access to the full resources of Google’s private networks to enable large-scale ingestion of data, including replication to specific, user-selected data centers for compliance with regulations requiring that certain data remain within a specific country.
As part of the announcement, MapR CEO and Founder John Schroeder ran a demonstration of Terasort on a 5,024-core Hadoop cluster with 1,256 disks. The set-up was completed in 1:20 seconds at a total cost of $16 on the Google Compute Engine service. This same system created internally would require 1,460 physical servers and more than 11,000 cores, take months to set up, and cost more than $5 million. The main limitation of the Google solution, Floyer writes, is that users have to get their data to a Google Edge node for ingestion, which, however, is usually not a large problem.
Project Serengeti, in contrast is designed to manage Hadoop applications through vSphere inside the data center. It simplifies the provisioning of multiple Hadoop implementations from Cloudera, Greenplum, Hortonworks, IBM, and MapR across large populations of physical nodes. It includes a toolkit for the Hadoop Distributed File System (HDFS) and MapReduce under the Apache 2.0 license on the GitHub hosting platform. This abstracts the node environment from the physical layer.
The potential benefits of Serengeti are significant, Floyer writes, eliminating much of the effort, time, and errors involved in manual provisioning of large numbers of servers. However, the potential overheads are also significant. VSphere Enterprise Edition is required for every node, probably doubling the cost of nodes. Virtualization adds 10%-15% of overhead to the physical servers with potentially greater performance hits when working with randomized data with IO shared across multiple virtual machines. Floyer believes that VMware will tackle these issues in subsequent versions of Project Serengeti.
VMware has also updated Spring for Apache Hadoop, which Floyer expects to provide an interesting long-term integration option for production Hadoop environments.
Overall Floyer recommends that CTOs become familiar with both Big Data infrastructure systems, as each has areas where it offers advantages. And as both contenders improve their offerings, those advantages can become more compelling.
As with all Wikibon research, this alert is available free-of-charge on the Wikibon site. Interested parties are welcome to become Wikibon community members by registering on the site and to correct and comment on this and other research and to publish their own research and industry announcements.