Amazon Web Services: 1 Million Hadoop Clusters And Counting


Amazon is generally pretty cagey about its cloud-based Big Data business. But Adam Gray, a Senior Product Manager at Amazon Web Services, let slip this little nugget at an AWS Big Data event in Boston’s Back Bay this morning: Amazon supports over one million Hadoop clusters for customers as part of its Elastic MapReduce service per year.

Of course Hadoop clusters can vary significantly in size, but anyway you slice it, with over a million clusters, Amazon is undoubtedly one of the biggest Hadoop practitioners on the planet along with Yahoo, Facebook and Google.

And that wasn’t the only revelation. It was actually a really informative event, with Amazon pulling back the covers a bit on its own use of Hadoop and allowing two customers – Airbnb and S&P CapitalIQ – to speak publically about their AWS-based Hadoop deployments.

“Just One More 4th Quarter!”

Amazon Data Scientist John Rauser highlighted a number of ways Amazon uses Hadoop internally.  Two of the more interesting use cases below.

First, it uses Hadoop to support its associates program, in which affiliates post links to Amazon-based products on their websites and get a percentage of related revenue. Originally, Rauser explained, Amazon developers wrote three separate applications in C++ to process and analyze data associated with these transactions to determine how much to pay each affiliate. But the system quickly began to run up against scaling issues, particularly every fourth quarter (typically the busiest quarter of the year for Amazon.) Each October through December Amazon engineers scrambled to add memory to one application in particular, hoping it would hold on for “just one more  fourth quarter.” Finally, in 2010, Amazon turned to Hadoop on AWS to process the data. Now, data streams daily into S3 and each morning a Hadoop cluster processes the last 60 days worth of affiliate data, which is then shipped back to S3 and then on to further analyzed to provide accurate payout data.

Second, Amazon uses Hadoop to classify high-risk, high-value items that must be stored in highly secured areas of its fulfillment center. In other words, expensive items like Kindles and plasma TVs must be locked up tight until they’re shipped so nobody steals them. Until recently, the process involved running extremely dense SQL scripts against a large data warehouse which generated reports that were sent to each of the fulfillment centers. From there, humans made the decision on which items to secure. The problem with this approach, according to Rauser, is that it requires “highly skilled people” to make “sophisticated” judgment calls. That’s not a scalable approach. Amazon wanted to take advantage of machine learning algorithms so the system could decide which items to classify and secure, rather than Amazon workers. Today, Amazon stores its entire product catalog on S3 and uses a 2-node Hadoop cluster to process the small percentage of the catalog that changes each week. Machine learning takes it from there.

Hadoop Helping Backpackers Find Couches

Airbnb is an online marketplace that allows anybody to advertise and rent out spare room to travellers. If you’re living in San Francisco, for example, and have a free couch, you could make a few bucks if you let a backpacker from France crash in your living room for a couple nights. Like Amazon, Airbnb uses Hadoop to store all of its user data – that’s 10 million http requests per day – to support multiple use cases.

One is to identify its most valuable users, according to Topher Lin, a Data Engineer at Airbnb. Previously, the company determined who its most valuable customers were based on the amount of revenue each one generated personally for Airbnb. But today the company loads all its social media-related data – the company maintains 178 “social connections,” according to Lin – into a Hadoop cluster, then runs proprietary algorithms against it to identify those customers that have the most influence on their social networks when it comes to doing business with Airbnb. So the most valuable customers in a given market are not necessarily the ones that generate the most revenue personally, but the ones who prompt the most number of their social connections to also do business with Airbnb. The company can then target these influencers to maximize their influencing effects to drive more business.

Another use involves calculating demand for a given area. Until recently, Airbnb relied on occupancy rates to determine demand for a city or region. So if only 25% of available crash pads were occupied in Chicago, for example, Airbnb would assume demand for its services in that city were low and it wasn’t worth the money or time trying to find more potential business. But that’s a faulty assumption, according to Lin. Instead, because Hadoop allows it to store and access all the data related to searches on its site, Airbnb can determine if occupancy rates are low not because demand is low but because there aren’t enough suitable matches. It could be that ther is huge demand for single rooms and larger in Chicago area, but the majority of spaces available are living room couches. Airbnb can then work to meet the actual demand.

Its Big Data approach appears to be working. Airbnb booked 1 million guest nights in 2011. Its already topped 5 million in the first four months of this year. That Hadoop cluster looks like its going to get some more work.

What’s More Noteworthy, A “Dividend Increase” or “Dividend Affirmation”?

S&P CapitalIQ provides its users with comprehensive information around particular companies or organizations they may want to invest in, according to Data Scientist Jeff Sternberg. It does this by allowing users to build personalized, online dashboards that detail the “key developments” of the companies they’re interested in. Sternberg was careful to point out, however, that S&P CapitalIQ is an impartial data platform and is not authorized to make trading recommendations.

Instead it leverages Hadoop to suggest to users which companies to keep an eye on by analyzing both user behavior and tracking and scoring news stories related to the companies it watches. For example, it analyzes news stories and press releases to determine which events to highlight in a company’s history. The system learns as it processes more and more data, so it now knows for example that a company announcing a “dividend increase” is more noteworthy than a “dividend affirmation,” which is nothing more than company confirming a previous announcement. The job is not so hard for small companies, but consider large companies like Facebook.

Screenshot of sample CapitalIQ "Key Developments" Service

There are probably thousands of discrete events that occur each year related to Facebook, but its CapitalIQ’s job to zero in on the most noteworthy and important ones to communicate to users. That means analyzing a lot of data. Because the company uses Amazon’s EMR service, it can scale up quickly to meet unexpected demand, and scale back just as fast when the spike in demand subsides, said Sternberg. The company is in the process of building out another 16-node Hadoop cluster using Hive on Amazon EMR via a Ruby Controller Script.

About Jeffrey Kelly

Jeffrey F. Kelly is a Principal Research Contributor at The Wikibon Project, an open source research and advisory firm based in Boston. His research focus is the business impact of Big Data and the emerging Data Economy. Mr. Kelly's research has been quoted and referenced by the Wall Street Journal, the Financial Times, Forbes,, IDG News, TechTarget and more. Reach him by email at or Twitter at @jeffreyfkelly.