Cloudera & Hadoop Use Cases Rob Lancaster | Omer Trajman "Big Data"... Applications From Enterprises to Individuals
The ‘Big Data’ Phenomenon ©2011 Cloudera, Inc. All Rights Reserved. 2 Big Data Drivers: The proliferation of data capture and creation technologies Increased “interconnectedness” drives consumption (creating more data) Inexpensive storage makes it possible to keep more, longer Innovative software and analysis tools turn data into information Big Data encompasses not only the content itself, but how it’s consumed. More Devices More Consumption More Content New & Better Information Every gigabyte of stored content can generate a petabyte or more of transient data* The information about you is much greater than the information you create *Source: IDC 2011
Big Data Challenges It’s not just about “big” ©2011 Cloudera, Inc. All Rights Reserved. 3 Cost-effectively managing the volume, velocity and variety of data Deriving value across structured and unstructured data Adapting to context changes and integrating new data sources and types
Common Challenges ©2011 Cloudera, Inc. All Rights Reserved. 4 1 Network Analysis and Sessionization 2 Content Optimization and Engagement Modeling 3 Usage Analysis and Mediation 4 Entity Surveillance and Signal Monitoring 5 Recommendations and Modeling 6 Loyalty, Promotion Analysis and Targeting 7 Fraud Analysis, Reconciliation and Risk 8 Time series Analysis, Mapping and Modeling
What is Apache Hadoop? 5 Hadoop Distributed File System (HDFS) MapReduce Consolidates Mixed Data Complex and relational data into a single repository Stores Inexpensively Keep raw data always available Processes at the Source Eliminate ETL bottlenecks Mine data first, govern later Apache Hadoop is a platform for data storage and processing that is… Scalable Fault tolerant Open source CORE HADOOP COMPONENTS ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
Cloudera in Production ©2011 Cloudera, Inc. All Rights Reserved. 6 Logs Files Web Data Relational Databases IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse Operational Rules Engines Management Tools OPERATORSENGINEERS ANALYSTSBUSINESS USERS Cloudera’s Distribution Including Apache Hadoop (CDH) & SCM Express Cloudera Enterprise Cloudera Management Suite Cloudera Support UNIVERSITY Consulting Services Cloudera University Web Application CUSTOMERS
What Can Hadoop Do For You? ©2011 Cloudera, Inc. All Rights Reserved. 7 ADVANCED ANALYTICS 12 Two Core Use Cases Applied Across Industries DATA PROCESSING Social Network Analysis Content Optimization Network Analytics Loyalty & Promotions Analysis Fraud Analysis Entity Analysis Clickstream Sessionization Engagement Mediation Data Factory Trade Reconciliation SIGINT INDUSTRY TERM INDUSTRY Web Media Telco Retail Financial Federal Bioinformatics Genome Mapping Sequencing Analysis
Genomics Cost of DNA Sequencing Falling Very Fast Raw data needs to be aligned and matched Scientists want to collect and analyze these sequences Hadoop Can Read Native Format hadoop-bam Java library for manipulation of Binary Alignment/Map Alignment, SNP discovery, genotyping Genomic Tools Based On Hadoop SEAL – distributed short read alignment BlastReduce – parallel read mapping Crossbow – whole genome re-sequencing analysis Cloudburst - sensitive MapReduce alignment Copyright 2010 Cloudera Inc. All rights reserved 8
Biodiversity Indexing Consolidation and serving of Biological data Provide free and open access to biodiversity data Collection, search, discovery and access to a variety of data Data matching and cleansing Geography, Water/land mapping Dictionaries and taxonomic services Data is harvested into multiple RDBMS Sqoop to Hadoop for processing workflows and index generation Sqoop back to MySQL for Web app serving Future development is to crawl into and serve from HBase ©2011 Cloudera, Inc. All Rights Reserved. 9
Processing Seismic Data Optimize the IO-intensive phases of seismic processing Incorporate additional parallelism where it makes sense Simplify gather/transpose operations with MapReduce Seismic Unix for Core Algorithms Well-known, used at many grad programs in geophysics SU file format can be easily transformed for processing on HDFS Hadoop Streaming Seismic Unix, SEPlib, Javaseis - non-Java code in MR Framework is aware of parameter files needed by SU commands Copyright 2011 Cloudera Inc. All rights reserved
Targeted Offers ©2011 Cloudera, Inc. All Rights Reserved. 11 The checkout lane is everywhere Cookies track users through ad impressions Purchasing behavior is time sensitive Logs collected from on-site and off-site browsing Data is ingested incrementally Process happens at a variety of time scales Data logged to HBase as primary store Some events naturally associate, others require deeper analysis Random access useful for debugging algorithms
Recommendations and Forecasting Copyright 2010 Cloudera Inc. All rights reserved 12 Collect and serve personalization information Wide variety of constantly changing data sources Data guaranteed to be messy Data ingestion includes collection of raw data Filtering and fixing of poorly formatted data Normalization and matching across data sources Analysis looks for reliable attributes and groupings Interpretation (e.g. gender by name) Aggregation across likely matching identifiers Identify possible predicted attributes or preferences
Who is Cloudera? 13 The #1 commercial and non-commercial Apache Hadoop distribution. Complete, Integrated Hadoop StackWho is Cloudera? Helps organizations profit from all their data Largest contributor to Hadoop ecosystem Provides the most widely used open source distribution Develops the most sophisticated Hadoop operations software Supports mission critical Hadoop clusters Trained the largest number of Hadoop Developers and Administrators Coordination Data Integration Fast Read/Write Access Languages / Compilers Workflow Scheduling Metadata APACHE ZOOKEEPER APACHE FLUME, APACHE SQOOP APACHE HBASE APACHE PIG, APACHE HIVE APACHE OOZIE APACHE HIVE File System Mount UI Framework SDK FUSE-DFSHUEHUE SDK ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
©2011 Cloudera, Inc. All Rights Reserved. 14 Cloudera helps you profit from all your data. cloudera.com +1 (888) twitter.com/ cloudera facebook.com/ cloudera Get Hadoop