For all its advantages in Big Data research, Hadoop has one major drawback, it struggles to achieve 20% server utilization, writes Jeff Kelly in his latest Wikibon Peer Incite, “Improving Hardware Efficiency Important to Overall Hadoop ROI”. This makes large, production Hadoop installations running on hundreds of nodes into a drain on IT CapEx. Earlier this summer VMware announced Project Serengeti, including the contribution of code to Apache Hadoop to make HDFS and MapReduce “virtualization aware”, to support Hadoop virtualization to attack this issue. The only problem – each node of a Hadoop cluster needs its own copy of vSphere Enterprise Edition, which adds a major “virtualization tax” to the implementation.
Now Pervasive Software has introduced a potentially less expensive solution based on a Java framework called DataRush that allows multiple Hadoop jobs to run in parallel on commodity hardware in vanilla JVMs, according to Pervasive Senior Director of Business Development and Strategy David Inbar. A second product, RushAnalyzer, speeds up Big Data preprocessing, Inbar says.
Together they abstract away the complexity of parallelizing Hadoop jobs, allow users to monitor IO and CPU use in real time and mitigate memory constraints. Inbar claims server efficiency rates as high as 80%. The products are getting attention from systems integrators and consultancies and, says Kelly, should also be considered along with Project Serengeti by IT groups moving from proof-of-concept trials of Hadoop to large-scale implementations. If Pervasive’s products perform as promised, the savings in hardware costs, power and cooling, and data center floor space can be significant, without the “virtualization tax”.