Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Distributed Computing
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Distributed Systems CS
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Performance Analysis of Multiprocessor Architectures
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
History of Distributed Systems Joseph Cordina
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Quasi Fat Trees for HPC Clouds and their Fault-Resilient Closed-Form Routing Technion - EE Department; *and Mellanox Technologies Eitan Zahavi* Isaac Keslassy.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Early Adopter Introduction to Parallel Computing: Research Intensive University: 4 th Year Elective Bo Hong Electrical and Computer Engineering Georgia.
Performance Evaluation of Parallel Processing. Why Performance?
Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
© 2012 MELLANOX TECHNOLOGIES 1 Disruptive Technologies in HPC Interconnect HPC User Forum April 16, 2012.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Relay Placement Problem in Smart Grid Deployment Wei-Lun Wang and Quincy Wu Department of Computer Science and Information Engineering, National Chi Nan.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
Self-Tuned Distributed Multiprocessor System Xiaoyan Bi CSC Operating Systems Dr. Mirela Damian.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Use of Performance Prediction Techniques for Grid Management Junwei Cao University of Warwick April 2002.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
4- Performance Analysis of Parallel Programs
A Survey of Data Center Network Architectures By Obasuyi Edokpolor
Tohoku University, Japan
Grid Computing.
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
Guoliang Chen Parallel Computing Guoliang Chen
Integration of MATLAB with Cometa infrastructure:
Hybrid Programming with OpenMP and MPI
MPJ: A Java-based Parallel Computing System
EE 4xx: Computer Architecture and Performance Programming
Presentation transcript:

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty of Engineering, University of Catania The enormous advances made in this last decade in the field of Grid Infrastructure, make very attractive their use to execute HPC parallel algorithms. As known, many factors may influence the relevant performance and scalability, among which the scheduling policy adopted at the broker level, the MPI libraries and their implementation used and the features of network adopted to connect the computing resources. In this paper, performance evaluation of a Grid Infrastructure realised by the Consortium COMETA under the PI2S2 project, will be presented. The performance evaluation has been carried out in terms of speedup, efficiency and scalability analyses of execution of parallel algorithms using the MPI paradigm. Well known benchmarks have been tested, among which : Matrix Multiplication (figure 1) and Pascal's Triangle (figure 2) algorithms. One of the aim of the performance evaluation was that to point out the influence of the kind of network used to interconnect computing resources. In particular the use of low latency and wide band communication network InfiniBand has been compared to Gigabit Ethernet. Another goal was relevant to the analysis of the influence of the MPI library and its implementation; for this reason, different implementations of MPI library have been considered: MPICH, MVAPICH, MVAPICH2.The main results achieved will be showed and the relevant conclusions will be figured out; among them, the advantages in higher parallel speedup using InfiniBand over Gigabit Ethernet, the minimal communication overhead offered by MVAPICH and superlinear speedup effect will be pointed out. N°of ElementData Size KB MB MB MB Pascal Triangle Algorithm (figure 2) Matrix Multiplication Algorithm (figure 1) Let's focus on the MVAPICH implementation and test different problem sizes. The bigger the problem the higher superlinear speedup achievable. This graph shows the speedup effect due to different data distribution on a growing number of resources. Given a problem size we can find the number of resources that will yeld superlinear speedup. Case A Case B We show the result of two different communication strategies: point-to-point (case A) and collective (case B) implementation considering the smallest and largest size than the test data. This algorithm shows superlinear speedup effect and efficiency value higher than one. How is this possible? What are the potential causes? MVAPICH has the best performance. Employing a higher number of processing units the available main memory grows, but also the cache memory does. If the problem size fits into the aggregated cache the resulting speedup can be superlinear. In both cases MVAPICH's performance resulted the best compared to the other implementations available. MVAPICH is a specialized library designed to use an InfiniBand interconnect. It offers minimal communication overhead allowing superior scalability on the infrastructure. On a distributed memory architecture the different access speed to memory levels can bring to superlinear speedup effects. This speedup can be explained when we envisage a realistic computational model considering full memory hierarchy and access time. L2 cache size per CPU in the COMETA Consortium Grid infrastructure is 1MB.If the problem size fits in a single processor L2 cache there is no superlinear effect; as the problem size grows the peak point will move to the right, in correspondence to the minimum number of CPUs that can hold the whole problem in their aggregated cache. The algorithm is scalable up to 16 CPUs, after that there is a remarkable communication overhead that cover the benefits of parallelism. According to Gustafson, an algorithm is scalable when, as the problem size and the number of CPUs grow, the execution time is fixed. If a massively parallel computation isn’t efficient for a code-problem pair, it could be efficient for the same code on a larger problem size.The aim of parallelism is to maximize the throughput keeping a costant execution time. When the input data size is small using a larger number of processor causes a speedup reduction due to the increased overhead that covers the parallelization advantages.