Hadoopla: Microsoft and the Hadoop Ecosystem Presented at SQL Saturday Waltham May 19th, 2012 Jim O’Neil Developer Evangelist, Microsoft jim.oneil@microsoft.com @jimoneil
Big Data Starts with a V Volume there’s a lot of it; we’re hoarders Variety schema-schmema, it’s coming from the ‘internet of things’ Velocity he who hesitates doesn’t get the worm
There’s a Tech for That Volume Data Warehouses Distributed File Systems + Map-Reduce Variety NoSQL databases Velocity Complex Event Processing
Two Dimensions of Scale Up Out
Scaling Out is Hard Programming complexity Number of Machines 1 2 3 4 5 6 … n Number of Machines
Distributed File Systems name node data node data node data node data node
Map Reduce job tracker name node data node data node data node task tracker
Map Reduce I am what I am Word count example I : 1 I : 2 I : 1 am: 1 var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Word count example Map Reduce I am what I am map I : 1 I : 2 reduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") context.write( words[i].toLowerCase(), 1);} } }; I : 1 am: 1 what : 1 am : 1 shuffle and sort var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) if (words[i] !== "") context.write(words[i].toLowerCase(), 1); } am: 1 what: 1 am: 2 what : 1 reduce
Enter Hadoop Apache project (http://hadoop.apache.org) Open source implementation of Google File System and MapReduce Hadoop Distributed File System (HDFS) Hadoop MapReduce Hadoop Common
Hadoop History 2002 Doug Cutting develops Nutch, web crawler 2004 Google publishes MapReduce + GFS paper 2006 Cutting joins Yahoo! Hadoop becomes Apache Lucene subproject Hadoop becomes top-level Apache project Cutting joins Cloudera 2011 Hortonworks formed by Yahoo! and Benchmark Capital 2011 Hadoop reaches version 1.0.0 (Dec. 27)
Adopters Yahoo! has a 40,000 node cluster Facebook has over 30PB of data in Hadoop Oracle’s Big Data Appliance includes a Hadoop distribution JP Morgan Chase uses it for fraud detection eBay is replacing its core search technology with it Microsoft is working with Hortonworks to distribute Hadoop on Windows both in the cloud and on-premises
http://hadooponazure.com Hadoop on Azure Limited customer preview Windows Server on-premises distribution to follow http://hadooponazure.com
Sign up
Cluster Provisioning
Demo
The Menagerie Begins Pig: query infrastructure for Hadoop SQL-like scripts (Pig Latin) launch map-reduce jobs http://pig.apache.org/ Hive: data warehouse system for Hadoop HiveQL (SQL-like) for querying (launching map reduce jobs) http://hive.apache.org
More Demo
More Ecosystem Hbase: NoSQL database built on HDFS Cassandra: Wide column NoSQL store Sqoop: bridge from RDBMS to HDFS
And More Flume: log aggregator to HDFS Scribe: another log aggregator Chukwa: log processing platform ______ / ___//_ ______ ____ / /_/ / / / / \/ __/ / __/ / /_/ / / / / __/ / / /_/\____/_/_/_/\__/ /_/ Distributed Log Collection.
And Some More Zookeeper: distributed system coordinator Oozie: workflow engine Avro: data serialization system Ganglia: distributed monitoring system
We’re Not Done Yet! Mahout: machine learning library Pegasus: graph mining system CloudBurst: genome sequence mapping
And It’s Just One Piece of the Big Data Pie Microsoft’s big data solution And It’s Just One Piece of the Big Data Pie FAMILIAR END USER TOOLS Power View Excel with PowerPivot Predictive Analytics Embedded BI BI PLATFORM SSAS SSRS Microsoft SQL Server / PDW Connectors Hadoop On Windows Azure Hadoop On Windows Server UNSTRUCTURED & STRUCTURED DATA Sensors Devices Bots Crawlers ERP CRM LOB
I meant what I said, and I said what I meant I meant what I said, and I said what I meant. An elephant's faithful, one hundred percent. Jim O’Neil Developer Evangelist, Microsoft jim.oneil@microsoft.com @jimoneil