May 23nd 2012 Matt Mead, Cloudera

Slides:

Advertisements

Similar presentations

Introduction to Hadoop Richard Holowczak Baruch College.

Advertisements

Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.

FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)

Cloudera & Hadoop Use Cases Rob Lancaster | Omer Trajman "Big Data"... Applications From Enterprises to Individuals.

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)

Hive: A data warehouse on Hadoop

ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.

Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.

Big Data Technologies for InfoSec Dive Deeper. See Further. Ram Sripracha UCLA / Sift Security.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

Page 1 © Hortonworks Inc – All Rights Reserved Hortonworks Naser Ali UK Building Energy Management Group Hadoop: A Data platform for businesses.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

Hadoop implementation of MapReduce computational model Ján Vaňo.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering, Cloudera, Inc. The Future of Data Management with Hadoop and the.

Nov 2006 Google released the paper on BigTable.

What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.

Big Data Tools Hadoop S.S.Mulay Sr. V.P. Engineering February 1, 2013.

Big Data Analytics with Excel Peter Myers Bitwise Solutions.

1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.

This is a free Course Available on Hadoop-Skills.com.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

Apache Hadoop on Windows Azure Avkash Chauhan

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Microsoft Partner since 2011

BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.

Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.

Microsoft Ignite /28/2017 6:07 PM

BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

Big Data Enterprise Patterns

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Chapter 14 Big Data Analytics and NoSQL

Hadoopla: Microsoft and the Hadoop Ecosystem

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Hadoop Clusters Tess Fulkerson.

© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024Low Power Wide Area Network.

Ministry of Higher Education

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Introduction to Apache

Overview of big data tools

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Presentation transcript:

May 23nd 2012 Matt Mead, Cloudera Hadoop Update Big Data Analytics May 23nd 2012 Matt Mead, Cloudera

Hadoop Distributed File System (HDFS) What is Hadoop? CORE HADOOP SYSTEM COMPONENTS Apache Hadoop is an open source platform for data storage and processing that is… Scalable Fault tolerant Distributed Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Provides storage and computation in a single, scalable system.

1 2 3 Why Use Hadoop? Move beyond rigid legacy frameworks. Hadoop handles any data type, in any quantity. Structured, unstructured Schema, no schema High volume, low volume All kinds of analytic applications Hadoop grows with your business. Proven at petabyte scale Capacity and performance grow simultaneously Leverages commodity hardware to mitigate costs Hadoop is 100% Apache® licensed and open source. No vendor lock-in Community development Rich ecosystem of related projects Hadoop helps you derive the complete value of all your data. Drives revenue by extracting value from data that was previously out of reach Controls costs by storing data more affordably than any other platform 1 2 3

The Need for CDH 1. The Apache Hadoop ecosystem is complex Many different components – lots of moving parts Most companies require more than just HDFS and MapReduce Creating a Hadoop stack is time-consuming and requires specific expertise Component and version selection Integration (internal & external) System test w/end-to-end workflows 2. Enterprises consume software in a certain way System, not silo Tested and stable Documented and supported Predictable release schedule

APACHE FLUME, APACHE SQOOP Core Values of CDH A Hadoop system with everything you need for production use. Storage Computation Integration Coordination Access Components of the CDH Stack File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Data Integration Languages / Compilers Fast Read/Write Access APACHE PIG, APACHE HIVE, APACHE MAHOUT APACHE FLUME, APACHE SQOOP APACHE HBASE HDFS, MAPREDUCE Coordination APACHE ZOOKEEPER

The Need for CDH A set of open source components, packaged into a single system. CORE APACHE HADOOP HDFS – Distributed, scalable, fault tolerant file system MapReduce – Parallel processing framework for large data sets WORKFLOW / COORDINATION Apache Oozie – Server-based workflow engine for Hadoop activities Apache Zookeeper – Highly reliable distributed coordination service QUERY / ANALYTICS Apache Hive – SQL-like language and metadata repository Apache Pig – High level language for expressing data analysis programs Apache HBase – Hadoop database for random, real- time read/write access Apache Mahout – Library of machine learning algorithms for Apache Hadoop DATA INTEGRATION Apache Sqoop – Integrating Hadoop with RDBMS Apache Flume – Distributed service for collecting and aggregating log and event data Fuse-DFS – Module within Hadoop for mounting HDFS as a traditional file system GUI / SDK Hue – Browser-based desktop interface for interacting with Hadoop CLOUD Apache Whirr – Library for running Hadoop in the cloud

Core Hadoop Use Cases 1 2 Two Core Use Cases Applied Across Verticals INDUSTRY TERM VERTICAL INDUSTRY TERM Social Network Analysis Web Media Telco Retail Financial Federal Bioinformatics Clickstream Sessionization Content Optimization Engagement Network Analytics Mediation ADVANCED ANALYTICS DATA PROCESSING Loyalty & Promotions Analysis Data Factory Fraud Analysis Trade Reconciliation Entity Analysis SIGINT Sequencing Analysis Genome Mapping

FMV & Image Processing Data Processing – Full Motion Video & Image Processing Record by record -> Easy Parallelization “Unit of work” is important Raw data in HDFS Adaptation of existing image analyzers to Map Only / Map Reduce Scales horizontally Simple detections Vehicles Structures Faces

Cybersecurity Analysis Advanced Analytics – Cybersecurity Analysis Rates and flows – ingest can be in excess of the multiple gigabyte per second range Can be complex because of mixed-workload clusters Typically involves ad-hoc analysis Question oriented analytics “Productionized” use cases allow insight by non-analysts Existing open source solution SHERPASURFING Focuses on the cybersecurity analysis underpinnings for common data-sets (pcap, netflow, audit logs, etc.) Provides a means to ask questions without reinventing all the plumbing

Index Preparation Data Processing – Index Preparation Hadoop’s Seminal Use Case Dynamic Partitioning -> Easy Parallelization String Interning Inverse Index Construction Dimensional data capture Destination indices Lucene/Solr (and derivatives) Endeca Existing solution USA Search (http://usasearch.howto.gov/)

Data Landing Zone Data Processing – Schema-less Enterprise Data Warehouse / Landing Zone Begins as storage, light ingest processing, retrieval Capacity scales horizontally Schema-less -> holds arbitrary content Schema-less -> allows ad-hoc fusion and analysis Additional analytic workload forces decisions

Hadoop: Getting Started Reactive Forced by scale or cost of scaling Proactive Seek talent ahead of need to build Identify data-sets Determine high-value use cases that change organizational outcomes Start with 10-20 nodes and 10+TB unless data-sets are super-dimensional Either way Talent a major challenge Start with “Data Processing” use cases Physical infrastructure is complex, make the software infrastructure simple to manage

Customer Success Self-Source Deployment vs. Cloudera Enterprise – 500 node deployment $5M Option 2: Self-Source Estimated Cost: $4.8 million Deployment Time: ~ 6 Months $4M Cost, $Millions $3M Option 1: Use Cloudera Enterprise Estimated Cost: $2 million Deployment Time: ~ 2 Months $2M $1M 1 2 3 4 5 6 Time required for Production Deployment (Months) Note: Cost estimates include personnel, software & hardware Source: Cloudera internal estimates

Cloudera Enterprise Subscription vs. Self-Source Customer Success Cloudera Enterprise Subscription vs. Self-Source Item Cloudera Enterprise Self-Source or Contract Support Offering World-Class, Global, Dedicated Contributors and Committers Must recruit, hire, train and retain Hadoop experts Monitoring and Management Fully Integrated application for Hadoop Intelligence Must be developed and maintained in house Support for the Full Hadoop Stack Full Stack* Unknown Regular Scheduled Releases Yearly Major, Quarterly Minor, Hot Fixes? N/A Training and Certification for the Full Hadoop Stack Available Worldwide None Support for Full Lifecycle All Inclusive Development through Production Community support Rich Knowledge-base 500+ Articles Production Solution Guides Included * Flume, FuseDFS, HBase, HDFS, Hive, Hue, Mahout, MR1, MR2, Oozie, Pig, Sqoop, Zookeeper

Erin Hawley Matt Mead Contact Us Business Development, Cloudera DoD Engagement ehawley@cloudera.com Matt Mead Sr. Systems Engineer, Cloudera Federal Engagements mmead@cloudera.com