May 23nd 2012 Matt Mead, Cloudera

May 23nd 2012 Matt Mead, Cloudera
Hadoop Update Big Data Analytics May 23nd 2012 Matt Mead, Cloudera

Hadoop Distributed File System (HDFS)
What is Hadoop? CORE HADOOP SYSTEM COMPONENTS Apache Hadoop is an open source platform for data storage and processing that is… Scalable Fault tolerant Distributed Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Provides storage and computation in a single, scalable system.

1 2 3 Why Use Hadoop? Move beyond rigid legacy frameworks.
Hadoop handles any data type, in any quantity. Structured, unstructured Schema, no schema High volume, low volume All kinds of analytic applications Hadoop grows with your business. Proven at petabyte scale Capacity and performance grow simultaneously Leverages commodity hardware to mitigate costs Hadoop is 100% Apache® licensed and open source. No vendor lock-in Community development Rich ecosystem of related projects Hadoop helps you derive the complete value of all your data. Drives revenue by extracting value from data that was previously out of reach Controls costs by storing data more affordably than any other platform 1 2 3

The Need for CDH 1. The Apache Hadoop ecosystem is complex
Many different components – lots of moving parts Most companies require more than just HDFS and MapReduce Creating a Hadoop stack is time-consuming and requires specific expertise Component and version selection Integration (internal & external) System test w/end-to-end workflows 2. Enterprises consume software in a certain way System, not silo Tested and stable Documented and supported Predictable release schedule

APACHE FLUME, APACHE SQOOP
Core Values of CDH A Hadoop system with everything you need for production use. Storage Computation Integration Coordination Access Components of the CDH Stack File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Data Integration Languages / Compilers Fast Read/Write Access APACHE PIG, APACHE HIVE, APACHE MAHOUT APACHE FLUME, APACHE SQOOP APACHE HBASE HDFS, MAPREDUCE Coordination APACHE ZOOKEEPER

The Need for CDH A set of open source components, packaged into a single system. CORE APACHE HADOOP HDFS – Distributed, scalable, fault tolerant file system MapReduce – Parallel processing framework for large data sets WORKFLOW / COORDINATION Apache Oozie – Server-based workflow engine for Hadoop activities Apache Zookeeper – Highly reliable distributed coordination service QUERY / ANALYTICS Apache Hive – SQL-like language and metadata repository Apache Pig – High level language for expressing data analysis programs Apache HBase – Hadoop database for random, real- time read/write access Apache Mahout – Library of machine learning algorithms for Apache Hadoop DATA INTEGRATION Apache Sqoop – Integrating Hadoop with RDBMS Apache Flume – Distributed service for collecting and aggregating log and event data Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system GUI / SDK Hue – Browser-based desktop interface for interacting with Hadoop CLOUD Apache Whirr – Library for running Hadoop in the cloud

Core Hadoop Use Cases 1 2 Two Core Use Cases Applied Across Verticals
INDUSTRY TERM VERTICAL INDUSTRY TERM Social Network Analysis Web Media Telco Retail Financial Federal Bioinformatics Clickstream Sessionization Content Optimization Engagement Network Analytics Mediation ADVANCED ANALYTICS DATA PROCESSING Loyalty & Promotions Analysis Data Factory Fraud Analysis Trade Reconciliation Entity Analysis SIGINT Sequencing Analysis Genome Mapping

FMV & Image Processing Data Processing – Full Motion Video & Image Processing Record by record -> Easy Parallelization “Unit of work” is important Raw data in HDFS Adaptation of existing image analyzers to Map Only / Map Reduce Scales horizontally Simple detections Vehicles Structures Faces

Cybersecurity Analysis
Advanced Analytics – Cybersecurity Analysis Rates and flows – ingest can be in excess of the multiple gigabyte per second range Can be complex because of mixed-workload clusters Typically involves ad-hoc analysis Question oriented analytics “Productionized” use cases allow insight by non-analysts Existing open source solution SHERPASURFING Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.) Provides a means to ask questions without reinventing all the plumbing

Index Preparation Data Processing – Index Preparation
Hadoop’s Seminal Use Case Dynamic Partitioning -> Easy Parallelization String Interning Inverse Index Construction Dimensional data capture Destination indices Lucene/Solr (and derivatives) Endeca Existing solution USA Search (

Data Landing Zone Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone Begins as storage, light ingest processing, retrieval Capacity scales horizontally Schema-less -> holds arbitrary content Schema-less -> allows ad-hoc fusion and analysis Additional analytic workload forces decisions

Hadoop: Getting Started
Reactive Forced by scale or cost of scaling Proactive Seek talent ahead of need to build Identify data-sets Determine high-value use cases that change organizational outcomes Start with nodes and 10+TB unless data-sets are super-dimensional Either way Talent a major challenge Start with “Data Processing” use cases Physical infrastructure is complex, make the software infrastructure simple to manage

Customer Success Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment $5M Option 2: Self-Source Estimated Cost: $4.8 million Deployment Time: ~ 6 Months $4M Cost, $Millions $3M Option 1: Use Cloudera Enterprise Estimated Cost: $2 million Deployment Time: ~ 2 Months $2M $1M 1 2 3 4 5 6 Time required for Production Deployment (Months) Note: Cost estimates include personnel, software & hardware Source: Cloudera internal estimates

Cloudera Enterprise Subscription vs. Self-Source
Customer Success Cloudera Enterprise Subscription vs. Self-Source Item Cloudera Enterprise Self-Source or Contract Support Offering World-Class, Global, Dedicated Contributors and Committers Must recruit, hire, train and retain Hadoop experts Monitoring and Management Fully Integrated application for Hadoop Intelligence Must be developed and maintained in house Support for the Full Hadoop Stack Full Stack* Unknown Regular Scheduled Releases Yearly Major, Quarterly Minor, Hot Fixes? N/A Training and Certification for the Full Hadoop Stack Available Worldwide None Support for Full Lifecycle All Inclusive Development through Production Community support Rich Knowledge-base 500+ Articles Production Solution Guides Included * Flume, FuseDFS, HBase, HDFS, Hive, Hue, Mahout, MR1, MR2, Oozie, Pig, Sqoop, Zookeeper

Erin Hawley Matt Mead Contact Us
Business Development, Cloudera DoD Engagement Matt Mead Sr. Systems Engineer, Cloudera Federal Engagements

May 23nd 2012 Matt Mead, Cloudera

Similar presentations

Presentation on theme: "May 23nd 2012 Matt Mead, Cloudera"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

May 23nd 2012 Matt Mead, Cloudera

Similar presentations

Presentation on theme: "May 23nd 2012 Matt Mead, Cloudera"— Presentation transcript:

Similar presentations

About project

Feedback