© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware
2 Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise
3 Real-time analysis allows instant understanding of market dynamics. Retailers can have intimate understanding of their customers needs and use direct targeted marketing. Market Segment Analysis Personalized Customer Targeting`
4 The Emerging Pattern of Big Data Systems: Retail Example Real-Time Streams Exa-scale Data Store Parallel Data Processing Parallel Data Processing Real-Time Processing Machine Learning Data Science Cloud Infrastructure Analytics
5 A single GE Jet Engine produces 10 Terabytes of data in one hour – 90 Petabytes per year. Enabling early detection of faults, common mode failures, product engineering feedback. Post Mortem Proactively Maintained Connected Product
6 Storage: Plan for Peta-scale Data Storage and Processing PB of Data Analytics Rapidly Outgrows Traditional Data Size by 100x
7 Cloud Infrastructure Supports Mixed Big Data Workloads Machine Learning Hadoop Real-Time Analytics Change workload types to Real-time Analytics, Machine Learning, Hadoop above cloud infra, too Cloud Infrastructure Machine Learning Hadoop Real-Time Analytics Management Network/Security Storage/Availability Compute
8 Cloud Infrastructure Supports Multiple Tenants Change workload types to Real-time Analytics, Machine Learning, Hadoop above cloud infra, too Cloud Infrastructure Management Network/Security Storage/Availability Compute Web User Analytics Financial Analysis Historical Customer Behavior
9 Software-defined Datacenter: Compute Agility / Rapid deployment Lower Capex Isolation for resource control and security Operational efficiency Management The Core Values of Virtualization Apply to Big Data Network/Security Storage/Availability Compute
10 Strong Isolation between Workloads is Key Hungry Workload 1 Reckless Workload 2 Nosy Workload 3 Cloud Infrastructure
11 Consolidation of workloads: Higher Utilization Hadoop 1 Hadoop 2 HBase Without virtualization independent Hadoop clusters each have access to fraction of total physical resources Consolidate and virtualize, -Consolidated cluster has access to entire pool of physical resources -For common use cases, reduce latency on priority jobs on consolidated cluster -Multiple HDFS striped across all physical hosts
12 Hadoop batch analysis Big Data Mix of Workloads File System/Data Store Host HBase real-time queries NoSQL Cassandra, Mongo, etc Big SQL Impala, Pivotal HawQ Compute layer Virtualization Host Other Spark, Shark, Solr, Platfora, Etc,…
13 Management Software-defined Datacenter: Storage Requirements of Next Generation Storage Network/Security Storage/Availability Compute 10x lower cost of storage Handle explosive data growth Support a variety of application types Solve the privacy and security issues
14 Software-defined Storage Enables Fundamental Economics Petabytes Deployed Traditional SAN/NAS Distributed Object Storage HDFS MAPR CEPH Scale-out NAS Isilon, NTAP
15 Big-Data using Local Disks Host Top of Rack Switch Servers with Local Disks core server SATA 2-4TB Disks 10 GbE adapter iSCSI/NFS for Shared Storage for vMotion etc,… High Performance 10GBE Switch per Rack
16 Big Data Storage Scale-out Network Storage Elastic Compute Scale-out Network Storage Hadoop Protocol Snapshots Posix Apps Full NFS Access Replication Erasure Coding
17 Customer Success: Hadoop as a Service at FedEx Scale-out Isilon Cluster -Shared Data -NAS + Hadoop Elastic vSphere Cluster -Mixed Workloads -vSphere -Existing Rack Mount Servers
18 Hadoop Virtual Node 2 NN data node Isilon Storage Configuration for Data/Compute Separation With Isilon Virtualization Host VMDK OS Image – VMDK Shared storage SAN/NAS OS Image – VMDK Hadoop Virtual Node 1 Ext4 Job- tracker Ext4 Temp OS Image – VMDK Ext4 Task- tracker Ext4 Hadoop Virtual Node 3 Ext4 Task- tracker Ext4
19 Agile Big Data at FedEx Trusted Isolation Well known auditable platform Security Deploy in minutes Optimize for shift in workload characteristics Agility Create true multi- tenancy Mixed workloads Elasticity
20 Breakthrough Use Cases Web Log Analysis Initial exploration was around detection of mobile devices accessing the website. Analysis of 570 billion web server log entries took approximately 9 minutes to complete on a small cluster. ZIP code Analysis Analysis of data to determine which ZIP codes are the highest source or destination for shipments. Shipment Analysis Analysis of shipment information to determine patterns that may delay a package.
21 Cloud Infrastructure is Ready for Big Data – Are you? Cloud Infrastructure
22 Q&A