Download presentation
Presentation is loading. Please wait.
Published byDerek Marsh Modified over 9 years ago
1
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC 2011. 2) Making Cloud Intermediate Data Fault-Tolerant, SOCC 2010. Present by: Qiangju Xiao
2
Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud SC ’11, 2011 Authors: Balaji Palanisamy Aameek Singh Ling Liu Bhushan Jain
3
Introduction (1) What does the paper present? – This paper designed Purlieus, a MapReduce resource allocation system aimed to enhance the performance of MapReduce jobs in the cloud. How does Purlieus work? – Provision virtual MapReduce clusters in a locality- aware manner; – Enable MapReduce VMs access to input data (Map Phase) and intermediate data (Reduce phase) from local or close-by physical machines
4
Introduction (2) What are the improvements for Purlieus? – Reduces cumulative data center network traffic; – 50% reduction in job execution times for a variety of workloads because network transfer times are big components of total execution time
5
Impact of Reduce Locality
6
System Model (1) – Current Cloud Infrastructure Data Load
7
System Model (2) – Purlieus Infrastructure 1)Data is broken into chunks 2)Blocks stored on distributed file system of the physical machines 3)VM access data on physical machines
8
System Model (3) – Dataflow from physical to virtual machines
9
Two Key Questions Data Placement – Which physical machines should be used for each dataset? VM Placement – Where should the VMs be provisioned to process these data blocks?
10
Purlieus’ Solution – Principles (1) Job Specific Locality-awareness – Placing data in the MapReduce cloud service should incorporate job characteristics like the amount of data accessed in the map and reduce phases. – Three distinct classes of jobs – (1) Map-input heavy; (2) Map-and-Reduce heavy; (3) Reduce- input-heavy.
11
Purlieus’ Solution – Principles (2) Load Awareness – Placing data in a MapReduce cloud should also account for computational load (CPU, memory ) on the physical machines. – Ensure that the expected load on the servers does not exceed a configurable threshold.
12
Purlieus’ Solution – Principles (3) Job-specific Data Replication – Replicas of the data set are placed based on the type and frequency of jobs. For example, if an input dataset is used by three sets of MapReduce jobs, two of which are reduce- input heavy and one map-input heavy, Purlieus places two replicas of data blocks in a reduce- input heavy fashion and the third one using map- input heavy strategy.
13
Purlieus – Placement Techniques (1) Map-input heavy jobs – Data placement Do not require reducers to be executed close to each other; Purlieus chooses machines that have the least expected load. – VM placement Attempt to place VMs on the physical machines that contain the input data chunks for the map phase; if those machines have lower expected computational load, the VM may be placed close to the node that stores the actual data chunk. Among the physical machines at a same network distance, the one having the least load is chosen.
14
Purlieus – Placement Techniques (2) Map and Reduce-input heavy jobs – Data Placement Should support reduce-locality – VMs should be machines close to each other; Data blocks get placed in a set of closely connected physical machines. – VM placement Ensure that VMs get placed on either the physical machines storing the input data or the close-by ones. Map tasks use local reads and reduce tasks also read within the same rack, maximizing the reduce locality
15
Purlieus – Placement Techniques (3) Reduce-input heavy jobs – Data Placement Map-locality is not so important; Chooses the physical machine with maximum free storage – VM placement Network traffic for transferring intermediate data among MapReduce VMs is intense in reduce-input heavy jobs and hence the set of VMs for the job should be placed close to each other.
16
Experiments Data Placement Techniques – Purlieus proposed locality and load-aware data placement (LLADP) – Random data placement (RDP) VM placement techniques: – Locality-unaware VMPlacement(LUAVP) – Map-locality aware VM placement (MLVP) – Reduce-locality aware VM placement (RLVP) – Map and Reduce-locality aware VM placement (MRLVP) – Hybrid locality-aware VM placement (HLVP): Our proposed HLVP technique adaptively picks the placement strategy based on type of the input job. It uses MLVP for map-input heavy, RLVP for reduce-input heavy jobs and MRLVP for map and reduce- input heavy jobs.
17
Results – Map and Reduce-input heavy workload
18
Results – Map-input heavy workload
19
Results – Reduce-input heavy workload
20
Results – Macro analysis using MapReduce simulator, PurSim (1)
21
Results – Macro analysis using MapReduce simulator, PurSim (2)
22
Conclusions Purlieus’ proposed placement techniques optimize for data locality during both map and reduce phases of the job by considering VM placement, MapReduce job characteristics and load on the physical cloud infrastructure at the time of data placement. Purlieus’ evaluation shows significant performance gains with some scenarios showing close to 50% reduction in the cross-rack network traffic.
23
Making Cloud Intermediate Data Fault-Tolerant SOCC 2010 Authors: Steven Y. Ko Imranul Hoque Brian Cho Indranil Gupta
24
MapReduce Phases – Map – Shuffle – Reduce Data –Input –Intermediate –Output
25
Intermediate Data Short lived Used immediately Discarded on completion Write once/ Read bounded Large Many blocks
26
Intermediate Data – Failures Cascaded re-execution
27
Intermediate Data Loss requires recomputation
28
Intermediate Data – Behavior breakdown 0f-10min 1f-30sec
29
Intermediate Data – Repliation Traditional replication expensive
30
Can replication be accomplished without significantly affecting execution speed?
31
Extend HDFS Asynchronous replication Replicate within rack Minimize replicated data
32
Asynchronous Replication HDFS Replication usually pessimistic – Blocks until replicas made Do not block (Async) – Consistency loss not problem - only one writer
33
Asynchronous Replication
34
Replicate within Rack HDFS replicates to a different rack for greater availability Lifespan of intermediate data short “Safe” to replicate to machine in same rack
35
Replicate within Rack
36
Minimize Data Replicated HDFS replication – Shuffle phase replicates most data as side effect – Only data used locally is not copied ISS – Replicate only local data
37
Minimize Data Replicated
38
IIS under failure
39
Conclusion Intermediate data properties allow a tailored replication strategy to outperform a traditional one Replication improves MapReduce performance in the case of failure
40
References 1) Balaji Palanisamy, Aameek Singh, Ling Liu, Bhushan Jain; Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC 2011. 2) Steven Y. Ko, Imranul Hoque, Brian Cho, Indranil Gupta; Making Cloud Intermediate Data Fault-Tolerant, SOCC 2010
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.