1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
Business Analytics for the 21 st Century TRENDS AND HOT TOPICS.
1 1 Apache Hadoop and the Emergence of the Enterprise Data Hub Eli Collins, Chief Technologist ©2014 Cloudera, Inc. All rights reserved.
Data-driven visualization of drug interactions. Adverse Drug Events Almost 1 million deaths/injuries each year in the US [1] Some fraction of ADEs are.
Running Hadoop-as-a-Service in the Cloud
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hive: A data warehouse on Hadoop
Eleventh Edition 1 Introduction to Essentials for Information Systems Irwin/McGraw-Hill Copyright © 2002, The McGraw-Hill Companies, Inc. All rights reserved.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Chapter 5 Clarifying the Research Question through Secondary Data and Exploration McGraw-Hill/Irwin Business Research Methods, 10e Copyright © 2008 by.
The Power of Choice in Data-Aware Cluster Scheduling
SM STRATA PRESENTATION Tim Garnto - SVP Engineering, edo Interactive Rob Rosen – Big Data Field Lead, Pentaho.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Name of the teacher -idClasses they handle and 8 Mrs.Tess and 8
Page 1 © Hortonworks Inc – All Rights Reserved Hortonworks Naser Ali UK Building Energy Management Group Hadoop: A Data platform for businesses.
Ch 4. The Evolution of Analytic Scalability
Chapter 11 Databases.
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
©2015 Apigee Corp. All Rights Reserved. Preserving signal in customer journeys Joy Thomas, Apigee Jagdish Chand, Visa.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.
Computer Engineering at the University of Houston.
Chapter 5 Clarifying the Research Question through Secondary Data and Exploration This chapter explains the use of secondary data sources to develop and.
1 Apache Spark and Its Role in the Enterprise Data Hub Mike Olson, Chief Strategy Officer,
USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson
1 © Cloudera, Inc. All rights reserved. Simplifying Analytic Workloads via Complex Schemas Josh Wills, Alex Behm, and Marcel Kornacker Data Modeling for.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Nov 2006 Google released the paper on BigTable.
Monitoring Hive: Metrics and WebUI
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Andy Roberts Data Architect
What is Data Science and Who is Data Scientist
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
An Introduction To Big Data For The SQL Server DBA.
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Top Advantages of SQL on Hadoop. More people Can Now access Hadoop It seems that SQL on Hadoop has made more egalitarian within the sense that wider groups.
Dumps PDF Perform Data Engineering on Microsoft Azure HD Insight dumps.html Complete PDF File Download From.
Big thanks to everyone!.
Qlik + Cloudera 10 Points of Integration
Big Data is a Big Deal!.
Hadoop and Spark Dynamic Data Models Amila Kottege Software Developer
1.3 Finite State Machines.
Data Platform and Analytics Foundational Training
Projects on Extended Apache Spark
Data Warehouse.
Central Florida Business Intelligence User Group
Hadoop EcoSystem B.Ramamurthy.
Team 2 – Mike, Rich, Sam and Steven DPS – PACE University
Budgeting with Power Pivot
Cse 344 May 4th – Map/Reduce.
Ch 4. The Evolution of Analytic Scalability
Phd Candidate Computational Physiology Lab University of Houston
Tiers vs. Layers.
Charles Tappert Seidenberg School of CSIS, Pace University
Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell.
Big Data Analysis in Digital Marketing
TECHNOLOGY, ENGINEERING AND DATA CONTINUING AND PROFESSIONAL EDUCATION
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to Machine Learning

2 © Cloudera, Inc. All rights reserved. My First Data Warehouse

3 © Cloudera, Inc. All rights reserved. My Current Data Warehouse

4 © Cloudera, Inc. All rights reserved. The Rise of the Data Scientist

5 © Cloudera, Inc. All rights reserved. Data Scientist Supply vs. Data Scientist Demand

6 © Cloudera, Inc. All rights reserved. Moneyball and Data Science

7 © Cloudera, Inc. All rights reserved. Choosing The Right Metrics

8 © Cloudera, Inc. All rights reserved. 1. Analyzing “Unstructured” Data Sources

9 © Cloudera, Inc. All rights reserved. 2. Building Machine Learning Models

10 © Cloudera, Inc. All rights reserved. 3. Turn Static Reports Into Analytical Applications

11 © Cloudera, Inc. All rights reserved. Answering More Questions in Less Time

12 © Cloudera, Inc. All rights reserved. How To Answer Questions Like A Data Scientist

13 © Cloudera, Inc. All rights reserved. 1. Read and deserialize input data. 2. Project/filter input records. 3. Shuffle: serialize it, send over the network, deserialize it. 4. Apply aggregation logic. 5. Serialize output data. The Life of a Data Processing Job

14 © Cloudera, Inc. All rights reserved. Handling the Cost of Serialization

15 © Cloudera, Inc. All rights reserved. The Traditional RDBMS Approach

16 © Cloudera, Inc. All rights reserved. The Cost of The Traditional RDBMS Approach

17 © Cloudera, Inc. All rights reserved. Query Scheduling and Exploratory Data Analysis

18 © Cloudera, Inc. All rights reserved. The Spark Approach

19 © Cloudera, Inc. All rights reserved. The Cost of the Spark Approach

20 © Cloudera, Inc. All rights reserved. The MapReduce Approach

21 © Cloudera, Inc. All rights reserved. MapReduce In The Hands of a Data Scientist

22 © Cloudera, Inc. All rights reserved. Example: Hive Multi-Insert

23 © Cloudera, Inc. All rights reserved. Our Goal: Public Transit for Questions

24 © Cloudera, Inc. All rights reserved. Data Modeling for Data Science

25 © Cloudera, Inc. All rights reserved. Motivating Example: Spelling Correction

26 © Cloudera, Inc. All rights reserved. Event Series Analytics

27 © Cloudera, Inc. All rights reserved. A Simple Star Schema for Spell Correction

28 © Cloudera, Inc. All rights reserved. The Combinatorial Explosion

29 © Cloudera, Inc. All rights reserved. What parameters does this model need… during the analysis phase? during deployment? Some Candidates Lag time between events Similarity of queries What else? Designing the Spell Correction Data Product

30 © Cloudera, Inc. All rights reserved. A Supernova Schema for Search

31 © Cloudera, Inc. All rights reserved. Spell Correction in SQL

32 © Cloudera, Inc. All rights reserved. Exhibit:

33 © Cloudera, Inc. All rights reserved. Querying Nested Types with Impala

34 © Cloudera, Inc. All rights reserved. Core Metric: # Outputs/ # Jobs Measure on both an individual and aggregate level Drive the marginal cost of asking one additional question towards zero Point business analysts at output tables for interactive analysis with Impala Self-serve BI frees up resources (compute + data science time) Trading Up: From Data Analyst to Data Scientist

35 © Cloudera, Inc. All rights reserved.