Download presentation
Presentation is loading. Please wait.
Published byTracy O’Neal’ Modified over 9 years ago
1
Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Department of Computer Science Duke University
2
The Growth of Data
3
MAD: Features of Ideal Analytics System M agnetism A gility D epth -- accept all data -- allow complex analysis -- adapt with data, real-time processing
4
M agnetism A gility D epth -- accept all data -- adapt with data, real-time processing -- allow complex analysis Hadoop is MAD - Blindly loads data into HDFS. - Fine-grained scheduler - End-to-end data pipeline - Dynamic node addition/ dropping - Well integrated with programming languages
5
Tuning for Good Performance: Challenges - Multiple dimensions of performance -- time, cost, scalability … - Tons of Parameters -- more than 190 parameters in Hadoop. - Multiple levels of abstraction -- job-level, workflow-level, workload-level …
6
Thumb rule Tuning for Good Performance: Challenges
7
Thumb rule Tuning for Good Performance: Challenges
8
Starfish: A Self-tuning System - Builds on Hadoop - Tunes to ‘good’ performance automatically
9
Starfish Architecture
10
The “What-if” Engine Model + simulation based prediction algo. Predicted performance Learning from previous job profiles Analytical models to estimate dataflow Simulating the execution of MR workload Profile of a job (P) + New parameter set (S) [Ref:] A What-if Engine for Cost-based MapReduce Optimization. H. Herodotou et.al.
11
The “What-if” Engine Ground truthEstimated by the What-if engine
12
Starfish Architecture: Job Level
13
Just-in-time optimizer -- Searches the parameter space Profiler -- Collects info. on MapReduce job execution through dynamic instrumentation -- Reports timings, data size, and resource utilization Sampler -- Generates profile statistics from training benchmark jobs
14
Starfish Architecture: Workflow Level
15
Scheduler to balanced distribution of data Block placement policy for data collocation -- deals with skewed data, add/drop of nodes, tradeoff between balanced data v/s data-locality -- Local-write v/s round-robin
16
Starfish Architecture: Workflow Level Producer Consumer Wasted production
17
Starfish Architecture: Workflow Level File level parallelism Block level parallelism
18
Starfish Architecture: Workflow Level What-if simulation Workflow Aware Optimizer Select best data layout and job parameters MR job execution Task scheduling Block placement Compare cost & benefits Running time? Data layout?
19
Starfish Architecture: Workload Level
20
Workload Optimizer Elastisizer Determine best cluster and Hadoop configurations Jumbo operator Cost based estimation for best optimization
21
Starfish: Summary - Optimizes on different granularities -- Workload, workflow, job (procedural & declarative) - Considers different decision points -- Provisioning, optimization, Scheduling, Data layout
22
Starfish: Piazza Discussion 1) Limited evaluation: 10 Top criticisms (till 1:30pm, 17 reviews): 2) Not explained well: 7 3) Profiler overhead/better search algo: 5 * What is the effect of wrong prediction? * What-if engine requires prior knowledge.
23
http://www.cs.duke.edu/starfish/ Thank you. Photo courtesy: Starfish group, Duke University
24
Going MAD with Big Data M agnetic system A gile system and Analytics D eep Analytics D ata Life Cycle Awareness E lasticity R obustness
25
Backup: What-if Engine 1
26
Backup: What-if Engine 2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.