Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

Slides:



Advertisements
Similar presentations
Performance of Cache Memory
Advertisements

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Algorithms Today we will look at: what we mean by efficiency in programs why efficiency matters what causes programs to be inefficient? will one algorithm.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Communication Performance Measurement and Analysis on Commodity Clusters Name Nor Asilah Wati Abdul Hamid Supervisor Dr. Paul Coddington Dr. Francis Vaughan.
Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Introduction to Parallel Processing Ch. 12, Pg
Multiprocessor Cache Coherency
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Introduction to Parallel Programming with C and MPI at MCSR Part 2 Broadcast/Reduce.
Parallel Computing Through MPI Technologies Author: Nyameko Lisa Supervisors: Prof. Elena Zemlyanaya, Prof Alexandr P. Sapozhnikov and Tatiana F. Sapozhnikov.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Introduction to Parallel Programming with C and MPI at MCSR Part 1 The University of Southern Mississippi April 8, 2010.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Parallelization of the Classic Gram-Schmidt QR-Factorization
MPI Communications Point to Point Collective Communication Data Packaging.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Computer Science Lecture 7, page 1 CS677: Distributed OS Multiprocessor Scheduling Will consider only shared memory multiprocessor Salient features: –One.
Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
MPI implementation – collective communication MPI_Bcast implementation.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Parallel Computing Presented by Justin Reschke
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Memory COMPUTER ARCHITECTURE
Distributed Shared Memory
Parallel Density-based Hybrid Clustering
Multiprocessor Cache Coherency
William Stallings Computer Organization and Architecture 7th Edition
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
COMP60621 Fundamentals of Parallel and Distributed Systems
More Quiz Questions Parallel Programming MPI Collective routines
Programming with Shared Memory Specifying parallelism
COMP60611 Fundamentals of Parallel and Distributed Systems
William Stallings Computer Organization and Architecture
Presentation transcript:

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of MPI Benchmark Programs on an SGI Altix ccNUMA Shared Memory Machine

Motivation MPI benchmark programs were developed and tested mainly on Distributed Memory machines. Measuring MPI performance of ccNUMA shared memory machines may be more complex than for distributed memory machines. Our recent work on measuring MPI performance on SGI Altix showed differences in results from some MPI benchmarks. There are no existing detailed comparisons of results from different MPI benchmarks. So, we aim to  Compare results of different MPI benchmarks on SGI Altix.  Investigate the reasons for any variations, which expected due to:- Differences in the measurement techniques Implementation details Default configurations of the different benchmarks

MPI benchmarks tested were: Pallas MPI Benchmark (PMB), SKaMPI, MPBench, Mpptest, MPIBench. Measurements were done on SGI Altix 3700 at SAPAC.  ccNUMA architecture  160 Intel Itanium 1.3 GHz.  160 GB of RAM.  SGI NUMAlink3 for communication  SGI Linux (ProPac3)  SGI MPI library Used MPI_DSM_CPULIST for process binding.  Must bind processes to CPUs for best performance Avoided use of CPU 0 (used for system processes). MPI Benchmark Experiments on the Altix

SGI Altix Architecture The Architecture of 128 Processor for SGI Altix Itanium2™ C-brick CPU and memory R-brick Router interconnect

It is possible to avoid the need to buffer messages in SGI MPI. SGI MPI has single copy options with better performance. This is the default for MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce for messages above a specified size (default is 2 Kbytes). However, it is not enabled by default for MPI_Send. Can force single copy (non-buffered) using MPI_BUFFER_MAX Using a single copy Send can give significantly better communication performance  almost a factor of 10 in some cases. Avoiding Message Buffering Using Single Copy

Point-to-Point Communication Different MPI benchmarks use different communication patterns to measure MPI_Send/MPI_Recv. MPIBench measures the effects of contention when all processors take part in point-to-point communication concurrently  not just the time for a ping-pong communication between two processors processor p communicates with processor (p+n/2) mod n n is the total number of processors P1P2P0P7P6P5P4P3

PMB and Mpptest which involve processors 0 and 1 only. MPBench uses the first and last processor. SKaMPI tests for processor with slowest communication time to P0 and uses that pair for measurements. P1P2P0P7P6P5P4P3 P1P2P0P7P6P5P4P3 These different approaches may give very different on a heirarchical ccNUMA architecture.

Comparison of results from different MPI benchmarks for Point-to-Point (send/receive) communications using 8 CPUs. MPI_Send/MPI_Recv

MPIBench results are higher since all processors communicate at once so there is contention. SKaMPI and MPBench is the second highest because they measure between the first and the last CPUs, which are on different C-Bricks. PMB and Mpptest results are lowest since they measure communication time between CPUs on the same node. In measuring single copy MPI_Send:  The results for SKaMPI and MPIBench were the same as for the default MPI setting that uses buffered copy.  This problem occurs because both SKaMPI and MPIBench use the same array to hold send and receive message data This appears to be an artefact of the SGI MPI implementation. MPI_Send/MPI_Recv

MPI_Bcast Comparison between MPI benchmarks for MPI_Bcast on 8 CPUs.

The main difference is that by default:  SKaMPI, Mpptest and PMB assume data is not held in cache memory.  MPIBench and MPBench do preliminary “warm-up” repetitions to ensure data is in cache before measurements are taken. Another difference is:  Measure collective communications time at the root node.  However for broadcast, the root node is the first to finish, and this may lead to biased results (pipelining effect).  To solve the problem - insert a barrier operation before each repetition.  But Mpptest and PMB adopt a different approach – they assign a different root processor for each repetition. Which also acts to clear the cache.  On a distributed memory machine this has little affect on the results.  However for ccNUMA shared memory machine, moving the root to a different processor has a significant overhead. Changing benchmarks to remove these differences gives very similar results (within a few percent)

Node time produced by SKaMPI for MPI_Bcast at 4MBytes on 8 cpus. Times for Different Processors Distribution results produce by MPIBench for MPI_Bcast at 4MBytes on 8 cpus. SKaMPI and MPIBench provide average times for each processor. MPIBench also provides distributions of times for all processors or individual processors. Can see effects of binary tree algorithm.

MPI_Barrier SKaMPI result is a bit higher than MPIBench and PMB. Probably due to the global clock synchronization that is used by default for their measurement.  they claim this is more accurate since it avoids pipelining effect. Comparison between MPI benchmarks for MPI_Barrier for 2 until 128 CPUs

MPI_Scatter Comparison between MPI benchmarks for MPI_Scatter on 32 CPUs Only MPIBench and SKaMPI measure MPI_Scatter and MPI_Gather.  Use similar approach and report very similar times. Unexpected hump for data sizes between 128 bytes and 2 KB per process.  SGI MPI uses buffered communications for message sizes less than 2 KBytes. The time for MPI_Scatter grows remarkably slowly with data size.

Time approx proportional to total data size for a fixed number of CPUs.  Bottleneck is root processor that gathers the data Slower times for more CPUs due to serialization and contention effects. MPI_Gather Comparison between MPI benchmarks for MPI_Gather on 32 CPUs

Measured by MPIBench, PMB and SKaMPI. Results for 32 processors similar to MPI_Scatter, but with a sharper increase for larger data sizes probably indicating effects of contention. Results from different benchmarks mostly agree within about 10%. MPI_Alltoall Comparison between MPI benchmarks for MPI_Alltoall on 32 CPUs

Summary Different MPI benchmarks can give significantly different results for certain MPI routines on the SGI Altix  Not usually the case for typical distributed memory architectures Due to different measurement techniques for the benchmarks.  For point-to-point communications: Different communications patterns Differences in how averages are computed. Implementation details of SGI MPI on the Altix, which affects whether single copy is used (e.g. SKaMPI and MPIBench for MPI_Send/Recv)  For some collective communications routines: Different defaults for use of cache Differences in synchronizing calls to the routines on each processor. Hierarchical ccNUMA architecture enhances effects of the differences.

Users of MPI benchmarks on shared memory machines should be careful in the interpretation of the results. MPI benchmarks were primarily designed for, and tested on, distributed memory machines. Consideration should be given to how they perform on shared memory machines. Developers of MPI benchmarks might consider making modifications to their benchmark programs to provide more accurate results for ccNUMA shared memory machines. Recommendations

END