Download presentation
Presentation is loading. Please wait.
Published byFelicity Dean Modified over 8 years ago
1
1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to Machine Learning
2
2 © Cloudera, Inc. All rights reserved. My First Data Warehouse
3
3 © Cloudera, Inc. All rights reserved. My Current Data Warehouse
4
4 © Cloudera, Inc. All rights reserved. The Rise of the Data Scientist
5
5 © Cloudera, Inc. All rights reserved. Data Scientist Supply vs. Data Scientist Demand
6
6 © Cloudera, Inc. All rights reserved. Moneyball and Data Science
7
7 © Cloudera, Inc. All rights reserved. Choosing The Right Metrics
8
8 © Cloudera, Inc. All rights reserved. 1. Analyzing “Unstructured” Data Sources
9
9 © Cloudera, Inc. All rights reserved. 2. Building Machine Learning Models
10
10 © Cloudera, Inc. All rights reserved. 3. Turn Static Reports Into Analytical Applications
11
11 © Cloudera, Inc. All rights reserved. Answering More Questions in Less Time
12
12 © Cloudera, Inc. All rights reserved. How To Answer Questions Like A Data Scientist
13
13 © Cloudera, Inc. All rights reserved. 1. Read and deserialize input data. 2. Project/filter input records. 3. Shuffle: serialize it, send over the network, deserialize it. 4. Apply aggregation logic. 5. Serialize output data. The Life of a Data Processing Job
14
14 © Cloudera, Inc. All rights reserved. Handling the Cost of Serialization
15
15 © Cloudera, Inc. All rights reserved. The Traditional RDBMS Approach
16
16 © Cloudera, Inc. All rights reserved. The Cost of The Traditional RDBMS Approach
17
17 © Cloudera, Inc. All rights reserved. Query Scheduling and Exploratory Data Analysis
18
18 © Cloudera, Inc. All rights reserved. The Spark Approach
19
19 © Cloudera, Inc. All rights reserved. The Cost of the Spark Approach
20
20 © Cloudera, Inc. All rights reserved. The MapReduce Approach
21
21 © Cloudera, Inc. All rights reserved. MapReduce In The Hands of a Data Scientist
22
22 © Cloudera, Inc. All rights reserved. Example: Hive Multi-Insert
23
23 © Cloudera, Inc. All rights reserved. Our Goal: Public Transit for Questions
24
24 © Cloudera, Inc. All rights reserved. Data Modeling for Data Science
25
25 © Cloudera, Inc. All rights reserved. Motivating Example: Spelling Correction
26
26 © Cloudera, Inc. All rights reserved. Event Series Analytics
27
27 © Cloudera, Inc. All rights reserved. A Simple Star Schema for Spell Correction
28
28 © Cloudera, Inc. All rights reserved. The Combinatorial Explosion
29
29 © Cloudera, Inc. All rights reserved. What parameters does this model need… during the analysis phase? during deployment? Some Candidates Lag time between events Similarity of queries What else? Designing the Spell Correction Data Product
30
30 © Cloudera, Inc. All rights reserved. A Supernova Schema for Search
31
31 © Cloudera, Inc. All rights reserved. Spell Correction in SQL
32
32 © Cloudera, Inc. All rights reserved. Exhibit: http://github.com/jwills/exhibit
33
33 © Cloudera, Inc. All rights reserved. Querying Nested Types with Impala
34
34 © Cloudera, Inc. All rights reserved. Core Metric: # Outputs/ # Jobs Measure on both an individual and aggregate level Drive the marginal cost of asking one additional question towards zero Point business analysts at output tables for interactive analysis with Impala Self-serve BI frees up resources (compute + data science time) Trading Up: From Data Analyst to Data Scientist
35
35 © Cloudera, Inc. All rights reserved. Thanks! @josh_wills
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.