Presentation is loading. Please wait.

Presentation is loading. Please wait.

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Similar presentations


Presentation on theme: "Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei."— Presentation transcript:

1 Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter : Yu Chih Lin

2 Outline  Introduction  Background  Model and New Strategy  Implementation  Experiment  Conclusion

3 Introduction  MapReduce is an important programming model Processing Generating large data sets  Commonly used in applications web indexing Data mining machine learning

4 Introduction  Multi-core CPU supporting virtualization technology Run two or more virtual machines (VMs) simultaneously Share the I/O resources to users  MapReduce is set up on a distributed file system Goolge uses GFS Hadoop uses HDFS

5 Introduction  In a virtual environmen runs MapReduce, three major problems Disk sharing results in unbalanced data distribution and unbalanced workload I/O interference caused by data unbalance and load unbalance Disk sharing reduces the data redundancy

6 Introduction  Purpose of this paper Abstract a model Define evaluation metrics Analyze the data pattern and task pattern  For Hadoop propose a location-aware file block allocation strategy

7 Introduction  Three main benefits by using this paper strategy MapReduce’s workload is more balanced Reduces I/O interference and improves HDFS’s performance Retains data’s redundancy

8 Background  I/O has two kinds of traditional interference Disk interference – when multiple processes try to access the same disk simultaneously Network interference – mainly considers the latency and throughput

9 Background  I/O virtualization has two kinds of virtualization KVM Paravirtualization  Virtual machines share CPUs and memory well, but not I/O.

10 Background  Virtualized Hadoop architecture

11 Model and New Strategy  Build a generation model to analyze different allocation strategies Data pattern Task pattern  To simply the problem for analyzing, make the four assumptions

12 Model and New Strategy  Using the same I/O devices hosts and number of virtual machines on each physical machine  All the virtual machines are in local area network and the network topology is flat  No limitation for workload to be randomly assigned to each virtual machine  All file blocks have the same size

13 Model and New Strategy

14

15

16

17

18  A new allocation strategy Replicas of a file block to different physical machines Keeps balance ofthe block number of each physical machines  Present two intuitive ways Round-robin allocation Serpentine allocation

19  For example, take p = 8, n = 8 (p : physical machines, n : file blocks)  An example of round-robin allocation Model and New Strategy

20  For example, take p = 8, n = 8(p : physical machines, n : file blocks)  An example of serpentine allocation

21 Model and New Strategy  Evaluation metrics for data pattern actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0  Enumeration average results for task patterns  round-robin allocation as results: maxAssignedNum=2.2724, assignedNumSigma=0.7943  serpentine allocation as results: maxAssignedNum=2.2705, assignedNumSigma=0.79323

22 Implementation  Choose serpentine allocation  Add the location information of virtual node into the network topology  For example, one rack among the physical machines may be changed from /default-rack to /Phy0  For example, some rack among the physical machines may be changed from /rack1 to /rack1/Phy0

23 Implementation  Mechanism makes Hadoop easy It can keep compatibility with the native Hadoop Make special label starting with “ Phy ” Identify locations of virtual machines

24 Implementation  To maintain the block information for each virtual node In NameNode of Hadoop, add a sorted list by the number of blocks  In the update first update the block number of the virtual node Second update its position in the sorted list

25 Evaluation  Simulation to compare New strategy (serpentine allocation) and Hadoop’s original strategy  Set parameter  n = 256  p = [8,16,32,64,128,256]  sampling number is set to 1,000,000

26 Evaluation  maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling

27 Evaluation  actualReplicaNum’s comparison original and new strategy

28 Evaluation  blockNumSigma’s comparison originals and new strategy

29 Evaluation  maxAssignedNum’s comparison original and new strategy

30 Evaluation  assignedNumSigma’s comparison original and new strategy

31 Experiment  N=224, P=8  SAMPLING NUMBER=1,000,000 OriginalNew Average of actualReplicaNum 2.06573 Average of maxBlockNum 90.579884 Average of blockNumSigma 4.17220 Average of maxAssignedNum 33.766034.5946 Average of assignedNumSigma 3.62564.14939

32 Experiment  Experiment results of RandomWriter’s execution time Red : SC off Blue : SC on

33 Experiment  Experiment results of TextSort’s execution time Red : SC off Blue : SC on

34 Experiment  Experiment results of WordCount’s execution time Red : SC off Blue : SC on

35 Conclusion  Address problems of data allocation and its impact on MapReduce system  Build a model and evaluation metrics to evaluate the data and task pattern  Propose a new strategy for file block allocation in Hadoop  Simulation and real experiments results prove the new allocation strategy is good


Download ppt "Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei."

Similar presentations


Ads by Google