A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

IT253: Computer Organization
Parallelizing Video Transcoding With Load Balancing On Cloud Computing Song Lin, Xinfeng Zhang, Qin Y, Siwei Ma Circuits and Systems, 2013 IEEE.
Optimizing Membrane System Implementation with Multisets and Evolution Rules Compression Workshop on Membrane Computing Eighth page 1 Optimizing Membrane.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
A Parallel Matching Algorithm Based on Image Gray Scale Liang Zong, Yanhui Wu cso, vol. 1, pp , 2009 International Joint Conference on Computational.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Parallel System Performance CS 524 – High-Performance Computing.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
1 Complexity of Network Synchronization Raeda Naamnieh.
Reference: Message Passing Fundamentals.
An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.
Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs John Oleszkiewicz, Li Xiao, Yunhao Liu IEEE TRASACTION ON PARALLEL.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Design and Performance Evaluation of Queue-and-Rate-Adjustment Dynamic Load Balancing Policies for Distributed Networks Zeng Zeng, Bharadwaj, IEEE TRASACTION.
Dynamic Load Balancing Experiments in a Grid Vrije Universiteit Amsterdam, The Netherlands CWI Amsterdam, The
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.
Design, Implementation, and Evaluation of Differentiated Caching Services Ying Lu, Tarek F. Abdelzaher, Avneesh Saxena IEEE TRASACTION ON PARALLEL AND.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Bluenet a New Scatternet Formation Scheme * Huseyin Ozgur Tan * Zifang Wang,Robert J.Thomas, Zygmunt Haas ECE Cornell Univ*
Parallel System Performance CS 524 – High-Performance Computing.
12006/9/26 Load Balancing in Dynamic Structured P2P Systems Brighten Godfrey, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica INFOCOM.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer.
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Parallel Programming in C with MPI and OpenMP
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.
The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison IEEE Transactions on Parallel and Distributed Systems, Vol.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Multi-Power-Level Energy Saving Management for Passive Optical Networks Speaker: Chia-Chih Chien Advisor: Dr. Ho-Ting Wu Date: 2015/03/25 1.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Energy-Efficient Wake-Up Scheduling for Data Collection and Aggregation Yanwei Wu, Member, IEEE, Xiang-Yang Li, Senior Member, IEEE, YunHao Liu, Senior.
A Bandwidth Scheduling Algorithm Based on Minimum Interference Traffic in Mesh Mode Xu-Yajing, Li-ZhiTao, Zhong-XiuFang and Xu-HuiMin International Conference.
A Stable Broadcast Algorithm Kei Takahashi Hideo Saito Takeshi Shibata Kenjiro Taura (The University of Tokyo, Japan) 1 CCGrid Lyon, France.
A Two-Tier Heterogeneous Mobile Ad Hoc Network Architecture and Its Load-Balance Routing Problem C.-F. Huang, H.-W. Lee, and Y.-C. Tseng Department of.
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
1 Munther Abualkibash University of Bridgeport, CT.
Edinburgh Napier University
Introduction to locality sensitive approach to distributed systems
COMP60621 Fundamentals of Parallel and Distributed Systems
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
COMP60611 Fundamentals of Parallel and Distributed Systems
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO. 12, DECEMBER 2006 Presented by 張肇烜

Outline Introduction Heterogeneous LogGP HLogGP Validation Experimental Results Conclusions

Introduction During the last decade, Beowulf clusters have had tremendous dissemination and acceptance. However the design and implementation of efficient parallel algorithms for clusters is still a problematic issue.

Introduction (cont.) In this paper, a new heterogeneous parallel computational model based on the LogGP model is proposed.

Heterogeneous LogGP Reasons for selecting LogGP model  The architecture is very similar to a cluster.  LogGP removes the synchronization points needed in other models.  LogGP allows considering both short and long messages.

Heterogeneous LogGP (cont.)  LogGP assumes finite network capacity, avoiding situation where the network becomes a bottleneck.  This model encourages techniques that yield good results in practice, such as designing algorithms with balanced communication patterns.

Heterogeneous LogGP (cont.) HLogGP Definition:  Latency, L: Communication latency depends on both network technology and topology.  The Latency Matrix of a heterogeneous cluster can be defined as a square matrix L={l 1,1, …, l m,m }.

Heterogeneous LogGP (cont.)  Overhead, o: the time needed by a processor to send or receive a message is referred to as overhead.  Sender overhead vector, O s ={os 1,…,os m }.  Receiver overhead vector, O r ={or 1,…,or m }.  Gap between message, g: this parameter reflects each node’s proficiency at sending consecutive short messages.  A gap vector g={g 1,…,g m }.

Heterogeneous LogGP (cont.)  Gap per byte, G: The Gap per byte depends on network technology.  In a heterogeneous network, a message can cross different switches with different bandwidths.  Gap matrix G={G 1,1,…,G m,m }.

Heterogeneous LogGP (cont.)  Computational power, P i : The number of nodes cannot be used in a heterogeneous model for measuring the system’s computational power.  A computational power vector P={P 1,…,P m }.

HLogGP Validation Cluster Description: 100Mbps 10Mbps (slow, S) (fast, F)

HLogGP Validation (cont.) Benchmark 1:

HLogGP Validation (cont.) Benchmark 2: Source code of the benchmark for measuring the gap between messages.

HLogGP Validation (cont.) Overhead:

HLogGP Validation (cont.) Overhead:

HLogGP Validation (cont.) Latency: Switch-switch hub-hub Switch-hub

HLogGP Validation (cont.) Gap between messages:

HLogGP Validation (cont.) Gap per Byte:

HLogGP Validation (cont.) Computational power:

Experimental Results Three objectives were pursued in the tests presented here.  To verify HLogGP is accurate enough to predict the response time of a parallel program.  To verify that heterogeneity has a strong impact on system performance.  To show how the cluster parametrization may be used for determining the performance of a parallel program on a real application environment.

Experimental Results (cont.) A volumetric magnetic resonance image compression application was selected. The sequential process may be divided into the following stages.  Data acquisition.  Data read and memory allocation.  Computation of the 3D Harr wavelet transform.

Experimental Results (cont.)  Thresholding.  Encoding of the subbands using the run- length encoding compression algorithm.  Write back of the compressed image.

Experimental Results (cont.) A theoretical analysis of the application’s response time is presented.  First stage: The master distributes the raw data among the slave processors. The number of total slices. The cluster’s total computational power. Slices of each slave i will receive

Experimental Results (cont.)  The total time for this stage is : Cycles of sending overhead to get the first byte into the network Subsequent bytes take G cycles to be sent Each byte travels through the network for cycles The receiving processor spends in receiving overhead

Experimental Results (cont.)  Second stage: In this case, the response time is the time spent by the last slave to finish its work.  The total response time for the second stage is estimated as the response time of a generic slave processor:

Experimental Results (cont.)  Third stage: The master process has to first gather the partial results produced by all of the slave processes.  The total response time of the third phase is calculated as a sumatory:

Experimental Results (cont.)  Fourth stage: The master process has to send an image subband to each of the slave processes.  The total time for this stage is:

Experimental Results (cont.)  Fifth stage: This stage is similar to the second, the amount of work is not distributed according to the nodes’ computational power.  This time could be given approximately by the following expression :

Experimental Results (cont.)  Sixth stage: This stage is similar to the third stage, the message’s sizes cannot be determined a priori. K is determined by the subband size

Experimental Results (cont.) Execution Results

Experimental Results (cont.) Execution Results

Conclusion In this paper, HlogGP model for heterogeneous clusters has been proposed and validated. The model can be applied to heterogeneous clusters where either the nodes, the interconnection network, or both are heterogeneous.