Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter ： Yu Chih Lin

Outline  Introduction  Background  Model and New Strategy  Implementation  Experiment  Conclusion

Introduction  MapReduce is an important programming model Processing Generating large data sets  Commonly used in applications web indexing Data mining machine learning

Introduction  Multi-core CPU supporting virtualization technology Run two or more virtual machines (VMs) simultaneously Share the I/O resources to users  MapReduce is set up on a distributed file system Goolge uses GFS Hadoop uses HDFS

Introduction  In a virtual environmen runs MapReduce, three major problems Disk sharing results in unbalanced data distribution and unbalanced workload I/O interference caused by data unbalance and load unbalance Disk sharing reduces the data redundancy

Introduction  Purpose of this paper Abstract a model Define evaluation metrics Analyze the data pattern and task pattern  For Hadoop propose a location-aware file block allocation strategy

Introduction  Three main benefits by using this paper strategy MapReduce’s workload is more balanced Reduces I/O interference and improves HDFS’s performance Retains data’s redundancy

Background  I/O has two kinds of traditional interference Disk interference – when multiple processes try to access the same disk simultaneously Network interference – mainly considers the latency and throughput

Background  I/O virtualization has two kinds of virtualization KVM Paravirtualization  Virtual machines share CPUs and memory well, but not I/O.

Background  Virtualized Hadoop architecture

Model and New Strategy  Build a generation model to analyze different allocation strategies Data pattern Task pattern  To simply the problem for analyzing, make the four assumptions

Model and New Strategy  Using the same I/O devices hosts and number of virtual machines on each physical machine  All the virtual machines are in local area network and the network topology is flat  No limitation for workload to be randomly assigned to each virtual machine  All file blocks have the same size

Model and New Strategy

 A new allocation strategy Replicas of a file block to different physical machines Keeps balance ofthe block number of each physical machines  Present two intuitive ways Round-robin allocation Serpentine allocation

 For example, take p = 8, n = 8 (p : physical machines, n : file blocks)  An example of round-robin allocation Model and New Strategy

 For example, take p = 8, n = 8(p : physical machines, n : file blocks)  An example of serpentine allocation

Model and New Strategy  Evaluation metrics for data pattern actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0  Enumeration average results for task patterns  round-robin allocation as results: maxAssignedNum=2.2724, assignedNumSigma=0.7943  serpentine allocation as results: maxAssignedNum=2.2705, assignedNumSigma=0.79323

Implementation  Choose serpentine allocation  Add the location information of virtual node into the network topology  For example, one rack among the physical machines may be changed from /default-rack to /Phy0  For example, some rack among the physical machines may be changed from /rack1 to /rack1/Phy0

Implementation  Mechanism makes Hadoop easy It can keep compatibility with the native Hadoop Make special label starting with “ Phy ” Identify locations of virtual machines

Implementation  To maintain the block information for each virtual node In NameNode of Hadoop, add a sorted list by the number of blocks  In the update first update the block number of the virtual node Second update its position in the sorted list

Evaluation  Simulation to compare New strategy (serpentine allocation) and Hadoop’s original strategy  Set parameter  n = 256  p = [8,16,32,64,128,256]  sampling number is set to 1,000,000

Evaluation  maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling

Evaluation  actualReplicaNum’s comparison original and new strategy

Evaluation  blockNumSigma’s comparison originals and new strategy

Evaluation  maxAssignedNum’s comparison original and new strategy

Evaluation  assignedNumSigma’s comparison original and new strategy

Experiment  N=224, P=8  SAMPLING NUMBER=1,000,000 OriginalNew Average of actualReplicaNum 2.06573 Average of maxBlockNum 90.579884 Average of blockNumSigma 4.17220 Average of maxAssignedNum 33.766034.5946 Average of assignedNumSigma 3.62564.14939

Experiment  Experiment results of RandomWriter’s execution time Red : SC off Blue : SC on

Experiment  Experiment results of TextSort’s execution time Red : SC off Blue : SC on

Experiment  Experiment results of WordCount’s execution time Red : SC off Blue : SC on

Conclusion  Address problems of data allocation and its impact on MapReduce system  Build a model and evaluation metrics to evaluate the data and task pattern  Propose a new strategy for file block allocation in Hadoop  Simulation and real experiments results prove the new allocation strategy is good

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Similar presentations

Presentation on theme: "Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Similar presentations

Presentation on theme: "Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei."— Presentation transcript:

Similar presentations

About project

Feedback