Download presentation
Presentation is loading. Please wait.
Published byManuel Mottley Modified over 10 years ago
1
Presented by Carl Erhard & Zahid Mian Authors: Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University
2
Analysis in the Big Data Era 9/26/20112 Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis Starfish
3
We want a MAD System 9/26/2011Starfish3 M agntetism Attracts or welcomes all sources of data, regardless of structure, values, etc. A gility Adaptive, remains in sync with rapid data evolution and modification D epth More than just your typical analytics, we need to support complex operations like statistical analysis and machine learning
4
No wait…I mean MADDER 9/26/2011Starfish4 D ata-lifecycle Do more than just queries, Awareness optimize the movement, storage, and processing of big data E lasticity Dynamically adjust resource usage and operational costs based on workload and user requirements R obustness Provide storage and querying services even in the event of some failures
5
Practitioners of Big Data Analytics Who are the users? Data analysts, statisticians, computational scientists… Researchers, developers, testers… Business Analysts… You! Who performs setup and tuning? The users! Usually lack expertise to tune the system 9/26/20115Starfish
6
Motivation 9/26/2011Starfish6
7
Tuning Challenges Heavy use of programming languages for MapReduce programs (e.g., Java/python) Data loaded/accessed as opaque files Large space of tuning choices (over 190 parameters!) Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking) Terabyte-scale data cycles 9/26/20117Starfish
8
Our goal: Provide good performance automatically Starfish: Self-tuning System 9/26/20118 MapReduce Execution Engine Distributed File System Hadoop Java / C++ / R / Python OozieHivePig Elastic MapReduce Jaql HBase Starfish Analytics System Starfish
9
What are the Tuning Problems? 9/26/20119 Job-level MapReduce configuration Workload management Data layout tuning Cluster sizing Workflow optimization J1J1 J2J2 J3J3 J4J4 Starfish
10
Starfishs Core Approach to Tuning 9/26/201110 1) if Δ(conf. parameters) then what …? 2) if Δ(data properties) then what …? 3) if Δ(cluster properties) then what …? Profiler Collects concise summaries of execution What-if Engine Estimates impact of hypothetical changes on execution Optimizers Search through space of tuning choices Job Workflow Workload Data layout Cluster Starfish
11
Starfish Architecture 9/26/201111Starfish
12
Job Level Tuning Just-in-Time Optimizer: Automatically selects efficient execution techniques for MapReduce jobs. Profiler: A Starfish component which is able to collect detailed summaries of jobs on a task-by-task basis. Sampler: Collects statistics about input, intermediate, and output data of a MapReduce job. 9/26/2011Starfish12
13
MapReduce Job Execution 9/26/201113 split 0 map out 0 reduce split 2 map split 1 map split 3 map Out 1 reduce job j = Starfish
14
What Controls MR Job Execution? Space of configuration choices: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Whether output data from tasks should be compressed Whether combine function should be used 9/26/201114 job j = Starfish
15
Effect of Configuration Settings Use defaults or set manually (rules-of-thumb) Rules-of-thumb may not suffice 9/26/201115 Two-dimensional projection of a multi- dimensional surface (Word Co-occurrence MapReduce Program) Rules-of-thumb settings Starfish
16
MapReduce Job Tuning in a Nutshell Goal: Challenges: p is an arbitrary MapReduce program; c is high-dimensional; … 9/26/201116 Profiler What-if Engine Optimizer Runs p to collect a job profile (concise execution summary) of Given profile of, estimates virtual profile for Enumerates and searches through the optimization space S efficiently Starfish
17
Job Profile Concise representation of program execution as a job Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation 9/26/201117 Memory Buffer Merge Sort, [Combine], [Compress] Serialize, Partition map Merge split DFS SpillCollectMapRead Starfish
18
Job Profile Fields Dataflow: amount of data flowing through task phases Map output bytes Number of spills Number of records in buffer per spill 9/26/201118 Costs: execution times at the level of task phases Read phase time in the map task Map phase time in the map task Spill phase time in the map task Dataflow Statistics: statistical information about dataflow Width of input key-value pairs Map selectivity in terms of records Map output compression ratio Cost Statistics: statistical information about resource costs I/O cost for reading from local disk per byte CPU cost for executing the Mapper per record CPU cost for uncompressing the input per byte Starfish
19
Generating Profiles by Measurement Goals Have zero overhead when profiling is turned off Require no modifications to Hadoop Support unmodified MapReduce programs written in Java or Hadoop Streaming/Pipes (Python/Ruby/C++) Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use Btrace (Hadoop internals are in Java) 9/26/201119Starfish
20
Generating Profiles by Measurement 9/26/201120 split 0 map out 0 reduce split 1 map raw data map profile reduce profile job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action JVM Enable Profiling ECA rules Starfish
21
Results of Job Profiling 9/26/2011Starfish21
22
Results using Job Profiling 9/26/2011Starfish22
23
Workflow-Aware Scheduling Unbalanced Data Layout Skewed Data Data Layout Not Considered when SchedulingTasks Addition/Dropping PartitionsNo Rebalance Can Lead to Failures Due to Space Issues Locality-Aware Schedulers Can Make Problem Worse Possible Solutions: Change # of Replicas Collocating Data (Block Placement Policy) 9/26/2011Starfish23
24
Impact of Unbalanced Data Layout 9/26/2011Starfish24
25
Impact of Unbalanced Data Layout 9/26/2011Starfish25
26
Impact of Unbalanced Data Layout 9/26/2011Starfish26
27
Workflow-Aware Scheduling Makes Decisions by Considering Producer-Consumer Relationships 9/26/2011Starfish27 Nodes
28
Starfishs Workflow-Aware Scheduler Space of Choices: Block Placement Policy: Round Robin (Local Write is default) Replication Factor Size of blocks: general large for large files Compression: Impacts I/O; not always beneficial 9/26/2011Starfish28
29
Starfishs Workflow-Aware Scheduler What-If Questions A) Expected runtime of Job P if the RR block placement policy is used for Ps output files? B) New Data layout in the cluster if the RR block placement policy is used for Ps output files? C) Expected runtime of Job C1 (C2) if its input data layout is the one in the answer of Question (above)? D) Expected runtimes of Jobs C1 and C2 if scheduled concurrently when Job P completes? E) Given Local Write block policy and RF = 1 for Job Ps output, what is the expected increase in the runtime of Job C1 if one node in the cluster fails during C1s execution? 9/26/2011Starfish29
30
Estimates from the What-if Engine 9/26/201130 Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia True surfaceEstimated surface Starfish
31
Workflow Scheduler Picks Layout 9/26/2011Starfish31
32
Optimizations-Workload Optimizer 9/26/2011Starfish32
33
Provisioning--Elastisizer Motivation: Amazon Elastic MapReduce Data in S3, processed in-cluster, stored to S3 User Pays for Resources Used Elastisizer Determines … Best cluster Hadoop configurations … Based on user-specified goals (execution time and costs) 9/26/2011Starfish33
34
Elastisizer Configuration Evaluation 9/26/2011Starfish34
35
Elastisizer Configuration Evaluation 9/26/2011Starfish35
36
Elastisizer- Cluster Configurations 9/26/2011Starfish36
37
Multi-objective Cluster Provisioning 9/26/201137 Instance Type for Source Cluster: m1.large Starfish
38
Critique of Paper Good Source Available for Implementation Able to See the impact of various settings Good Visualization Tools Tutorials/Source available at duke.edu/starfish Bad The paper (and subsequent materials) talk a lot about what Starfish does, but not necessarily how it does it There is no documentation on LastWord, and this seems important Only works after a the job/workflow has been executed at least once 9/26/2011Starfish38
39
Starfishs Visualizer Timeline Views Shows progress of a job execution at the task level See execution of same job with different settings Data-flow Views View of flow of data among nodes, along with MR jobs Video Mode allows playback execution from past Profile Views Timings, data-flow, resource-level 9/26/2011Starfish39
40
Timeline Views 9/26/2011Starfish40
41
Timeline View 9/26/2011Starfish41
42
Data Skew View 9/26/2011Starfish42
43
Data Skew View 9/26/2011Starfish43
44
Data Skew View 9/26/2011Starfish44
45
Data-flow Views 9/26/2011Starfish45
46
References Herodotou, Herodotos, et al. "Starfish: A self-tuning system for big data analytics." Proc. of the Fifth CIDR Conf. 2011. Dong, Fei. Extending Starfish to Support the Growing Hadoop Ecosystem. Diss. Duke University, 2012. Herodotou, Herodotos, Fei Dong, and Shivnath Babu. "MapReduce programming and cost-based optimization? Crossing this chasm with Starfish." Proceedings of the VLDB Endowment 4.12 (2011). http://www.cs.duke.edu/starfish/ http://www.youtube.com/watch?v=Upxe2dzE1uk 9/26/2011Starfish46
47
Backup 9/26/2011Starfish47
48
Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics 9/26/201148 MapReduce Execution Engine Distributed File System Hadoop Java / C++ / R / Python OozieHivePig Elastic MapReduce Jaql HBase Starfish
49
Workflow-level Tuning Starfish has a Workflow-aware Scheduler which addresses several concerns: How do we equally distribute computation across nodes? How do we adapt to imbalance of a load or energy cost? The Workflow-aware Scheduler works with the What- if Engine and the Data Manager to answer these questions 9/26/2011Starfish49
50
Workload-level Tuning Starfishs Workload Optimizer is aware of each workflow that will be executed. It reorders the workflows in order to make them more efficient. This includes reusing data for different workflows that use the same MapReduce jobs. 9/26/2011Starfish50
51
What-if Engine Job Oracle Virtual Job Profile for What-if Engine 9/26/201151 Task Scheduler Simulator Job Profile Properties of Hypothetical job Input Data Properties Cluster Resources Configuration Settings Possibly Hypothetical Starfish
52
Virtual Profile Estimation 9/26/201152 Given profile for job j = estimate profile for job j' = (Virtual) Profile for j' Dataflow Statistics Dataflow Cost Statistics Costs Profile for j Input Data d 2 Confi- guration c 2 Resources r 2 Costs White-box Models Cost Statistics Relative Black-box Models Dataflow White-box Models Dataflow Statistics Cardinality Models Starfish
53
Job Optimizer 9/26/201153 Best Configuration Settings for Subspace Enumeration Recursive Random Search Just-in-Time Optimizer Job Profile Input Data Properties Cluster Resources What-if calls Starfish
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.