Download presentation
Presentation is loading. Please wait.
Published byRosamund Pierce Modified over 9 years ago
1
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter : Yu Chih Lin
2
Outline Introduction Background Model and New Strategy Implementation Experiment Conclusion
3
Introduction MapReduce is an important programming model Processing Generating large data sets Commonly used in applications web indexing Data mining machine learning
4
Introduction Multi-core CPU supporting virtualization technology Run two or more virtual machines (VMs) simultaneously Share the I/O resources to users MapReduce is set up on a distributed file system Goolge uses GFS Hadoop uses HDFS
5
Introduction In a virtual environmen runs MapReduce, three major problems Disk sharing results in unbalanced data distribution and unbalanced workload I/O interference caused by data unbalance and load unbalance Disk sharing reduces the data redundancy
6
Introduction Purpose of this paper Abstract a model Define evaluation metrics Analyze the data pattern and task pattern For Hadoop propose a location-aware file block allocation strategy
7
Introduction Three main benefits by using this paper strategy MapReduce’s workload is more balanced Reduces I/O interference and improves HDFS’s performance Retains data’s redundancy
8
Background I/O has two kinds of traditional interference Disk interference – when multiple processes try to access the same disk simultaneously Network interference – mainly considers the latency and throughput
9
Background I/O virtualization has two kinds of virtualization KVM Paravirtualization Virtual machines share CPUs and memory well, but not I/O.
10
Background Virtualized Hadoop architecture
11
Model and New Strategy Build a generation model to analyze different allocation strategies Data pattern Task pattern To simply the problem for analyzing, make the four assumptions
12
Model and New Strategy Using the same I/O devices hosts and number of virtual machines on each physical machine All the virtual machines are in local area network and the network topology is flat No limitation for workload to be randomly assigned to each virtual machine All file blocks have the same size
13
Model and New Strategy
18
A new allocation strategy Replicas of a file block to different physical machines Keeps balance ofthe block number of each physical machines Present two intuitive ways Round-robin allocation Serpentine allocation
19
For example, take p = 8, n = 8 (p : physical machines, n : file blocks) An example of round-robin allocation Model and New Strategy
20
For example, take p = 8, n = 8(p : physical machines, n : file blocks) An example of serpentine allocation
21
Model and New Strategy Evaluation metrics for data pattern actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0 Enumeration average results for task patterns round-robin allocation as results: maxAssignedNum=2.2724, assignedNumSigma=0.7943 serpentine allocation as results: maxAssignedNum=2.2705, assignedNumSigma=0.79323
22
Implementation Choose serpentine allocation Add the location information of virtual node into the network topology For example, one rack among the physical machines may be changed from /default-rack to /Phy0 For example, some rack among the physical machines may be changed from /rack1 to /rack1/Phy0
23
Implementation Mechanism makes Hadoop easy It can keep compatibility with the native Hadoop Make special label starting with “ Phy ” Identify locations of virtual machines
24
Implementation To maintain the block information for each virtual node In NameNode of Hadoop, add a sorted list by the number of blocks In the update first update the block number of the virtual node Second update its position in the sorted list
25
Evaluation Simulation to compare New strategy (serpentine allocation) and Hadoop’s original strategy Set parameter n = 256 p = [8,16,32,64,128,256] sampling number is set to 1,000,000
26
Evaluation maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling
27
Evaluation actualReplicaNum’s comparison original and new strategy
28
Evaluation blockNumSigma’s comparison originals and new strategy
29
Evaluation maxAssignedNum’s comparison original and new strategy
30
Evaluation assignedNumSigma’s comparison original and new strategy
31
Experiment N=224, P=8 SAMPLING NUMBER=1,000,000 OriginalNew Average of actualReplicaNum 2.06573 Average of maxBlockNum 90.579884 Average of blockNumSigma 4.17220 Average of maxAssignedNum 33.766034.5946 Average of assignedNumSigma 3.62564.14939
32
Experiment Experiment results of RandomWriter’s execution time Red : SC off Blue : SC on
33
Experiment Experiment results of TextSort’s execution time Red : SC off Blue : SC on
34
Experiment Experiment results of WordCount’s execution time Red : SC off Blue : SC on
35
Conclusion Address problems of data allocation and its impact on MapReduce system Build a model and evaluation metrics to evaluate the data and task pattern Propose a new strategy for file block allocation in Hadoop Simulation and real experiments results prove the new allocation strategy is good
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.