Download presentation
Presentation is loading. Please wait.
Published byFelix Haynes Modified over 9 years ago
1
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group (DPRG) University of Illinois at Urbana-Champaign
2
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds 2
3
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks 3
4
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data 4
5
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data ◦ Outline of a solution 5
6
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds ◦ Dataflow programming frameworks ◦ The importance of intermediate data ◦ Outline of a solution This talk ◦ Builds up the case ◦ Emphasizes the need, not the solution 6
7
Dataflow Programming Frameworks Runtime systems that execute dataflow programs ◦ MapReduce (Hadoop), Pig, Hive, etc. ◦ Gaining popularity for massive-scale data processing ◦ Distributed and parallel execution on clusters A dataflow program consists of ◦ Multi-stage computation ◦ Communication patterns between stages 7
8
Example 1: MapReduce Two-stage computation with all-to-all comm. ◦ Google introduced, Yahoo! open-sourced (Hadoop) ◦ Two functions – Map and Reduce – supplied by a programmer ◦ Massively parallel execution of Map and Reduce 8 Stage 1: Map Stage 2: Reduce Shuffle (all-to-all)
9
Example 2: Pig and Hive Pig from Yahoo! & Hive from Facebook Built atop MapReduce Declarative, SQL-style languages Automatic generation & execution of multiple MapReduce jobs 9
10
Example 2: Pig and Hive Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 10 Shuffle (all-to-all) 1-to-1 comm.
11
Usage 11
12
Usage Google (MapReduce) ◦ Indexing: a chain of 24 MapReduce jobs ◦ ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) ◦ WebMap: a chain of 100 MapReduce jobs Facebook (Hadoop + Hive) ◦ ~300TB total, adding 2TB/day (in 2008) ◦ 3K jobs processing 55TB/day Amazon ◦ Elastic MapReduce service (pay-as-you-go) Academic clouds ◦ Google-IBM Cluster at UW (Hadoop service) ◦ CCT at UIUC (Hadoop & Pig service) 12
13
One Common Characteristic Intermediate data ◦ Intermediate data? data between stages Similarities to traditional intermediate data ◦ E.g.,.o files ◦ Critical to produce the final output ◦ Short-lived, written-once and read-once, & used-immediately 13
14
One Common Characteristic Intermediate data ◦ Written-locally & read-remotely ◦ Possibly very large amount of intermediate data (depending on the workload, though) ◦ Computational barrier 14 Stage 1: Map Stage 2: Reduce Computational Barrier
15
Computational Barrier + Failures Availability becomes critical. ◦ Loss of intermediate data before or during the execution of a task => the task can’t proceed 15 Stage 1: Map Stage 2: Reduce
16
Current Solution Store locally & re-generate when lost ◦ Re-run affected map & reduce tasks ◦ No support from a storage system Assumption: re-generation is cheap and easy 16 Stage 1: Map Stage 2: Reduce
17
Hadoop Experiment Emulab setting (for all plots in this talk) ◦ 20 machines sorting 36GB ◦ 4 LANs and a core switch (all 100 Mbps) Normal execution: Map–Shuffle–Reduce 17 MapShuffleReduce
18
Hadoop Experiment 1 failure after Map ◦ Re-execution of Map-Shuffle-Reduce ~33% increase in completion time 18 MapShuffleReduceMap Shuffl e Reduce
19
Re-Generation for Multi-Stage Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 19
20
Importance of Intermediate Data Why? ◦ Critical for execution (barrier) ◦ When lost, very costly Current systems handle it themselves. ◦ Re-generate when lost: can lead to expensive cascaded re-execution ◦ No support from the storage We believe the storage is the right abstraction, not the dataflow frameworks. 20
21
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data ◦ Outline of a solution Why is storage the right abstraction? Challenges Research directions 21
22
Why is Storage the Right Abstraction? Replication stops cascaded re-execution. 22 Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce
23
So, Are We Done? No! Challenge: minimal interference ◦ Network is heavily utilized during Shuffle. ◦ Replication requires network transmission too. ◦ Minimizing interference is critical for the overall job completion time. Any existing approaches? ◦ HDFS (Hadoop’s default file system): much interference (next slide) ◦ Background replication with TCP-Nice: not designed for network utilization & control (no further discussion, please refer to our paper) 23
24
Modified HDFS Interference Unmodified HDFS ◦ Much overhead with synchronous replication Modification for asynchronous replication ◦ With an increasing level of interference Four levels of interference ◦ Hadoop: original, no replication, no interference ◦ Read: disk read, no network transfer, no actual replication ◦ Read-Send: disk read & network send, no actual replication ◦ Rep.: full replication 24
25
Modified HDFS Interference Asynchronous replication ◦ Network utilization makes the difference Both Map & Shuffle get affected ◦ Some Maps need to read remotely 25
26
Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data ◦ Outline of a new storage system design Why is storage the right abstraction? Challenges Research directions 26
27
Research Directions Two requirements ◦ Intermediate data availability to stop cascaded re-execution ◦ Interference minimization focusing on network interference Solution ◦ Replication with minimal interference 27
28
Research Directions Replication using spare bandwidth ◦ Not much network activity during Map & Reduce computation ◦ Tight B/W monitoring & control Deadline-based replication ◦ Replicate every N stages Replication based on a cost model ◦ Replicate only when re-execution is more expensive 28
29
Summary Our position ◦ Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Problem: cascaded re-execution Requirements ◦ Intermediate data availability ◦ Interference minimization Further research needed 29
30
BACKUP 30
31
Default HDFS Interference Replication of Map and Reduce outputs 31
32
Default HDFS Interference Replication policy: local, then remote-rack Synchronous replication 32
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.