Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Slides:



Advertisements
Similar presentations
Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.
Advertisements

Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
SLA-Oriented Resource Provisioning for Cloud Computing
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
LOAD BALANCING IN A CENTRALIZED DISTRIBUTED SYSTEM BY ANILA JAGANNATHAM ELENA HARRIS.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Energy-efficient Virtual Machine Provision Algorithms for Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer.
Availability in Globally Distributed Storage Systems
Gueyoung Jung, Nathan Gnanasambandam, and Tridib Mukherjee International Conference on Cloud Computing 2012.
Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010.
Performance Evaluation of Load Sharing Policies on a Beowulf Cluster James Nichols Marc Lemaire Advisor: Mark Claypool.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Edge Based Cloud Computing as a Feasible Network Paradigm(1/27) Edge-Based Cloud Computing as a Feasible Network Paradigm Joe Elizondo and Sam Palmer.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Redes Inalámbricas Máster Ingeniería de Computadores 2008/2009 Tema 7.- CASTADIVA PROJECT Performance Evaluation of a MANET architecture.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
Network Aware Resource Allocation in Distributed Clouds.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
報告人 : 葉瑞群 日期 :2012/01/9 出處 : IEEE Transactions on Knowledge and Data Engineering.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Hadoop and HDFS
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
A Low-Power CAM Design for LZ Data Compression Kun-Jin Lin and Cheng-Wen Wu, IEEE Trans. On computers, Vol. 49, No. 10, Oct Presenter: Ming-Hsien.
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
Mining High Utility Itemset in Big Data
Scientific Workflow Scheduling in Computational Grids Report: Wei-Cheng Lee 8th Grid Computing Conference IEEE 2007 – Planning, Reservation,
Dynamic Resource Monitoring and Allocation in a virtualized environment.
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.
Performance Evaluation of Image Conversion Module Based on MapReduce for Transcoding and Transmoding in SMCCSE Speaker : 吳靖緯 MA0G IEEE.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Grid Appliance The World of Virtual Resource Sharing Group # 14 Dhairya Gala Priyank Shah.
The IEEE International Conference on Cluster Computing 2010
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.
A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.
HDFS MapReduce Hadoop  Hadoop Distributed File System (HDFS)  An open-source implementation of GFS  has many similarities with distributed file.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Load Rebalancing for Distributed File Systems in Clouds.
Virtual Machine in HPC PAK MARKTHUB (13M54040) 1 VIRTUAL MACHINE IN HPC.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
CSS534: Parallel Programming in Grid and Cloud
Hadoop Clusters Tess Fulkerson.
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Presentation transcript:

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1 Reporter : Yu Chih Lin

Outline  Introduction  Background  Model and New Strategy  Implementation  Experiment  Conclusion

Introduction  MapReduce is an important programming model Processing Generating large data sets  Commonly used in applications web indexing Data mining machine learning

Introduction  Multi-core CPU supporting virtualization technology Run two or more virtual machines (VMs) simultaneously Share the I/O resources to users  MapReduce is set up on a distributed file system Goolge uses GFS Hadoop uses HDFS

Introduction  In a virtual environmen runs MapReduce, three major problems Disk sharing results in unbalanced data distribution and unbalanced workload I/O interference caused by data unbalance and load unbalance Disk sharing reduces the data redundancy

Introduction  Purpose of this paper Abstract a model Define evaluation metrics Analyze the data pattern and task pattern  For Hadoop propose a location-aware file block allocation strategy

Introduction  Three main benefits by using this paper strategy MapReduce’s workload is more balanced Reduces I/O interference and improves HDFS’s performance Retains data’s redundancy

Background  I/O has two kinds of traditional interference Disk interference – when multiple processes try to access the same disk simultaneously Network interference – mainly considers the latency and throughput

Background  I/O virtualization has two kinds of virtualization KVM Paravirtualization  Virtual machines share CPUs and memory well, but not I/O.

Background  Virtualized Hadoop architecture

Model and New Strategy  Build a generation model to analyze different allocation strategies Data pattern Task pattern  To simply the problem for analyzing, make the four assumptions

Model and New Strategy  Using the same I/O devices hosts and number of virtual machines on each physical machine  All the virtual machines are in local area network and the network topology is flat  No limitation for workload to be randomly assigned to each virtual machine  All file blocks have the same size

Model and New Strategy

 A new allocation strategy Replicas of a file block to different physical machines Keeps balance ofthe block number of each physical machines  Present two intuitive ways Round-robin allocation Serpentine allocation

 For example, take p = 8, n = 8 (p : physical machines, n : file blocks)  An example of round-robin allocation Model and New Strategy

 For example, take p = 8, n = 8(p : physical machines, n : file blocks)  An example of serpentine allocation

Model and New Strategy  Evaluation metrics for data pattern actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0  Enumeration average results for task patterns  round-robin allocation as results: maxAssignedNum=2.2724, assignedNumSigma=  serpentine allocation as results: maxAssignedNum=2.2705, assignedNumSigma=

Implementation  Choose serpentine allocation  Add the location information of virtual node into the network topology  For example, one rack among the physical machines may be changed from /default-rack to /Phy0  For example, some rack among the physical machines may be changed from /rack1 to /rack1/Phy0

Implementation  Mechanism makes Hadoop easy It can keep compatibility with the native Hadoop Make special label starting with “ Phy ” Identify locations of virtual machines

Implementation  To maintain the block information for each virtual node In NameNode of Hadoop, add a sorted list by the number of blocks  In the update first update the block number of the virtual node Second update its position in the sorted list

Evaluation  Simulation to compare New strategy (serpentine allocation) and Hadoop’s original strategy  Set parameter  n = 256  p = [8,16,32,64,128,256]  sampling number is set to 1,000,000

Evaluation  maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling

Evaluation  actualReplicaNum’s comparison original and new strategy

Evaluation  blockNumSigma’s comparison originals and new strategy

Evaluation  maxAssignedNum’s comparison original and new strategy

Evaluation  assignedNumSigma’s comparison original and new strategy

Experiment  N=224, P=8  SAMPLING NUMBER=1,000,000 OriginalNew Average of actualReplicaNum Average of maxBlockNum Average of blockNumSigma Average of maxAssignedNum Average of assignedNumSigma

Experiment  Experiment results of RandomWriter’s execution time Red : SC off Blue : SC on

Experiment  Experiment results of TextSort’s execution time Red : SC off Blue : SC on

Experiment  Experiment results of WordCount’s execution time Red : SC off Blue : SC on

Conclusion  Address problems of data allocation and its impact on MapReduce system  Build a model and evaluation metrics to evaluate the data and task pattern  Propose a new strategy for file block allocation in Hadoop  Simulation and real experiments results prove the new allocation strategy is good