IBM’s pending acquisition of Vivisimo, announced last week, brings into focus an important component of the Big Data stack: Big Data discovery.
By Big Data discovery, I’m referring to a layer in the Big Data stack that allows Data Scientists to get a comprehensive view of all the data sources available to him or her within the enterprise. It’s a critical capability because you can’t really take advantage of multiple data sources for Big Data Analysis unless you know they exist in the first place.
Particularly in large enterprises with multiple locations and myriad departments, getting a handle on the totality of data sources available is practically impossible without the aid of a discovery-type platform or tool. It’s not uncommon for a large enterprise to have hundreds, even thousands of data sources in production at any one time. JPMorgan Chase, for example, stores over 150 petabytes of data in an astounding 30,000 databases.
And unless you’re planning on ripping-and-replacing all those relational databases, data warehouses and file stores, and loading all your data into Hadoop – and nobody’s doing that yet – without Big Data discovery capabilities Hadoop could become just another siloed data source. That defeats the whole purpose of Big Data Analytics, which aims to make all your data available for analysis without the need for sampling.
But providing a view of data sources is just one requirement of Big Data discovery. With a federated view of all data sources available, Data Scientists must also be able to access and integrate the various data sources into analytics platforms to perform Big Data Analytics; And Application developers, likewise, need access to the data sources to feed new Big Data apps.
Vivisimo’s touts its Velocity Information Optimization Platform as meeting both requirements – Big Data discovery and Big Data access/integration. It is a scale-out discovery platform that employs a schema-less index and search capabilities to provide near-real time metadata analysis, allowing users to search/visualize data sources from across the enterprise and incorporate them into analytic and application development environments.
IBM’s challenge is to integrate Vivisimo into its already crowded Big Data portfolio and further develop its app dev-enablement capabilities.
Attivio’s Active Intelligence Engine is a similar platform to Vivisimo’s, but AIE takes Big Data discovery a step further. AIE – which incorporates a hybrid indexing approach to handle multi-structured data from data sources including Hadoop and virtually all standard relational databases – includes application development capabilities as well as a unique query-time JOIN functionality that allows users to merge disparate data sources on the fly. Attivio also developed its own text analytics and sentiment analysis capabilities to supplement Big Data discovery and app dev. (Below see Attivio CTO Sid Probstein discuss Big Data discovery and how to avoid great ideas “Dying in the Parkinglot.”)
A third option comes from the Hadoop ecosystem. HCatalog is a table and storage management service under development with the goal of providing a view of both Hadoop-based data sources and more traditional databases and data warehouses, according to Hortonworks’ CEO Rob Bearden. It’s early days for HCatalog, though, which is in version 0.2.0 under the Apache project.
Big Data discovery is a necessary component of any Big Data stack because you need to know what data you’ve got before you can take advantage of it. Its an area still developing, however, so expect to see a lot more action in the Big Data discovery market from both Hadoop-focused vendors and the big relational database players.