Big Data isn’t just about 1s and 0s. More than most IT functions, Big Data and Data Science in particular are collaborative disciplines. Therefore, as I wrote in my 2012 Big Data predictions post, Big Data analytics platforms “are to Data Scientists what playgrounds are to five-year-olds: They’re both areas where exploration and socialization should occur.
Put another way, a good analytic platform allows Data Scientists to use the analytic tools of their choice to explore Big Data and makes it easy for them share, comment on and collaborate on analytic projects.
We’ve heard a lot from EMC over the last few months about a new analytic platform it has developed, called Chorus, that the vendor promised meets both requirements. Well, the company drew the curtain back on Chorus at press conference today and there is indeed a lot for Data Scientists to like.
First, the collaboration component. Chorus has social media and collaboration capabilities embedded within the platform that allow Data Scientists to build community networks and to view and build upon research from their peers in real-time. The platform can also index an organization’s data assets, making it easy for Data Scientists to search for and locate new data sets.
As for performing analytics, the day-in day-out work of data science, EMC decided not to create a new toolset or embed an existing tool in the platform. Rather, Chorus users can integrate the analytics tools of their choice into the platform, including well-worn and popular tools like SAS Data Miner and newer tools from vendors like Alpine Data Miner (who I profiled last May) and Squid Solutions. Perhaps most important, EMC said Chorus enables Data Scientists to push analytic models into the Greenplum database (more on in-database analytics here and here.)
Another feature sure to be popular with Data Scientists is self-service hardware provisioning. Chorus users can spin-up data “sandboxes” on virtual hardware provided by VMware’s vFabric.
Chorus goes GA on Friday, March 23 and is the third of three cornerstones of EMC’s Unified Analytic Platform (the other two being the Greenplum database and the Greenplum Hadoop distribution.) EMC also said it plans to open source Chorus during the second half of the year to encourage Big Data application developers to build on top of the platform. Stay tuned on that front.
The other big news of the day was the announcement that EMC plans to acquire Pivotal Labs, a consulting firm that helps its clients develop software and applications using an agile methodology and that worked with EMC to develop Chorus. The news of the acquisition actually leaked last week. My colleague Alex Williams wrote about the news on Friday, pointing out that Pivotal Labs “has developed a strong reputation for its Pivotal Tracker, an agile project management app. Its clients include Engine Yard, Groupon and New Relic.”
Pivotal Labs primary value is in the collaboration feature sets that it offers. In many respects, it treats development like a real-time, transparent, evolving story. It features virtual rooms for brainstorming. In these virtual rooms. updates from others are seen instantly, and changes from multiple people are merged, in real time. At its core is its tracker, which “drives conversation from its workflow transparency.” Story comments allow for feedback and discussion.
The fit between Chorus and Pivotal Labs is obvious: collaborative, real-time analytics and collaborative, real-time application development. EMC said it will leverage Pivotal Labs to assist its customers in building Big Data applications atop Chorus and UAP.
The last mile for Big Data is the application layer. Business users don’t care what infrastructure and databases you use – Hadoop, Greenplum, DB2, Oracle or something else. What they do care about is having access to intuitive, practical applications that help them do their jobs better and faster. With Pivotal Labs in the fold, EMC is well positioned to help its customers bridge this last mile with application development services.
As for Chorus, EMC is definitely on the right track, enabling data exploration and collaboration in a single, self-service platform. The real test still lies ahead, however, as Data Scientists begin using Chorus in real-world scenarios. There are a lot of moving parts and creating a frictionless user environment isn’t a trivial task.
My biggest concern with UAP on the whole is that the platform is, obviously and understandably, designed and optimized for EMC products. But Big Data is all about mashing up disparate data sets from a variety of sources, be they internal or external to the enterprise. EMC says that UAP users can integrate data from commercial databases from competing vendors like IBM and Oracle, but the company was vague on the details. In any event, it’s clearly true that competing databases won’t be supported or optimized to the extent that EMC’s Greenplum database and Hadoop distribution are. Enterprises considering investing in UAP should press EMC to address this issue before signing on the dotted line.