Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Slides:

Advertisements

Similar presentations

Performance of Cache Memory

Advertisements

Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Job Submission on WestGrid Feb on Access Grid.

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Reference: Message Passing Fundamentals.

1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.

Communication Performance Measurement and Analysis on Commodity Clusters Name Nor Asilah Wati Abdul Hamid Supervisor Dr. Paul Coddington Dr. Francis Vaughan.

DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

Topic Overview One-to-All Broadcast and All-to-One Reduction

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Multiprocessor Cache Coherency

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.

IFS Benchmark with Federation Switch John Hague, IBM.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallelization of the Classic Gram-Schmidt QR-Factorization

Hybrid MPI and OpenMP Parallel Programming

MPI Communications Point to Point Collective Communication Data Packaging.

Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.

1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen

Interconnection network network interface and a case study.

MPI implementation – collective communication MPI_Bcast implementation.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

Background Computer System Architectures Computer System Software.

Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Density-based Hybrid Clustering

Multiprocessor Cache Coherency

CMSC 611: Advanced Computer Architecture

by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow

Hybrid Programming with OpenMP and MPI

COMP60621 Fundamentals of Parallel and Distributed Systems

COMP60611 Fundamentals of Parallel and Distributed Systems

Optimizing MPI collectives for SMP clusters

Presentation transcript:

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance Computing Group School of Computer Science, University of Adelaide Adelaide SA 5005, Australia September 2005

Motivation A 1672 CPU SGI Altix (APAC AC) is the new APAC peak facility, replacing an AlphaServer SC with Quadrics network (APAC SC). Most parallel programs run on APAC machines use MPI. Want to measure MPI performance on APAC machines. Several MPI benchmark software available to measure MPI communications performance, ie: Pallas MPI Benchmark (PMB), SKaMPI, MPBench, Mpptest, and recently developed MPIBench. The measurements reported here used our benchmark, MPIBench. So, we aim to :-  Measure the MPI performance of the SGI Altix  Compare the Altix results with Alphaserver SC with Quadrics network.  Compare the results of MPIBench with other MPI Benchmarks.

MPIBench Current MPI benchmarks (MPBench, MPPtest, SkaMPI, Pallas) have several limitations. MPIBench is a communications benchmark program for accurately measuring the performance of MPI routines. Uses a highly accurate, globally synchronized clock  so can measure individual message times, not just average over many ping-pongs. Gives probability distributions (histograms) of times Makes point-point measurements across many nodes  see effects of network contention, not just ping-pong for 2 nodes  handles SMP clusters - on-node and off-node communications Provides greater insight for performance analysis Useful for more accurate performance modeling

Measurements were done on SGI Altix 3700 (Aquila) at SAPAC.  160 Intel Itanium 1.3 GHz (APAC AC is 1.6 GHz)  160 GB of RAM (APAC AC is 2 GB per CPU)  ccNUMA architecture  SGI NUMAlink for communication  SGI Linux (ProPac3)  SGI MPI library  Used MPI_DSM_CPULIST for process binding MPI Benchmark Experiments on the Altix

SGI Altix Architecture The Architecture of 128 Processor for SGI Altix Itanium2™ C-brick CPU and memory R-brick Router interconnect

It is possible under certain conditions to avoid the need to buffer messages in SGI MPI. SGI MPI provides single copy options with much better performance. This is the default for MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce for messages above a specified size (default is 2 Kbytes). However, it is not enable by default for MPI_Send. Can force single copy (non-buffered) using MPI_BUFFER_MAX Note that our measurements for MPI_Send use the default settings (buffered). Using a single copy send can give significantly better communication times – a factor of 2 or 3 in some cases. Avoid Message Buffering and Single Copy

Point-to-Point MPIBench measures not just the time for a ping-pong communication between two processors, but can also measure the effects of contention when all processors take part in point-to-point communication. (processor p communicating with processor (p+n/2) mod n  n is a total number of processor ) P1P2P0P7P6P5P4P3

PMB and Mpptest which involve processors 0 and 1 only. SKaMPI and MPBench which use the first and last processor processor. P1P2P0P7P6P5P4P3 P1P2P0P7P6P5P4P3

Comparison of results from different MPI benchmarks for Point-to- Point (send/receive) communications using 8 processors.

Number of Processors Latency (MPIBench) 1.96 us1.76 us2.14 us2.21 us2.56 us2.61 us2.70 us Latency (MPBench) 1.76 us2.07 us2.48 us2.41 us2.53 us3.06 us3.01 us Bandwidth (MPIBench) 851 MB/s671 MB/s464 MB/s462 MB/s256 MB/s 248 MB/s Bandwidth (MPBench) 831 MB/s925 MB/s562 MB/s 549 MB/s532 MB/s531 MB/s Measured latency (for sending a zero byte message) and bandwidth (for a 4 MByte message) for different numbers of processes on the Altix. Results for MPIBench are for all processes communicating concurrently, so include contention effects. Results for MPBench (in blue font) are for only two communicating processes (processes 0 and N-1) with no network or memory contention. Summary of Latency and Bandwidth Results.

Alphaserver SCSGI Altix 3700 LatencyInternode 5 Microseconds1.96 Microseconds Intranode 5 Microseconds1.76 Microseconds BandwidthInternode 262 MBytes/sec464 MB/s (buffered) 754MB/s (single copy) Intranode 740 MBytes/sec851 MB/s (buffered) 1421 MB/s (single copy)

MPI_Sendrecv MPI_Send MPI_Sendrecv SGI Altix with SGI MPI library  The time is similar with MPI_Send  So the bidirectional bandwidth works Alphaserver SC with Compaq MPI library  The time is 2 times slower.  So bidirectional bandwidth does NOT work

MPI_Bcast Broadcast usually implemented using tree structure of point-point communications, taking O(log N) time for N processes. Broadcast for SGI Altix take advantage of the Single Copy. The Quadrics network on the AlphaServer SC provides a very fast hardware broadcast. But only if the program is running on a contiguous set of processors. Otherwise, a standard software broadcast algorithm is used.

Comparison of broadcast performance with Alphaserver SC is difficult  Depends on whether hardware broadcast is used For smaller numbers of processors (< 32) the Altix does better due to its higher bandwidth. For larger numbers of processors the AlphaServer SC does better due to excellent scalability of hardware broadcast of the Quadrics network. Hardware broadcast of a 64 KByte message on the AlphaServer SC  0.40 ms for 16 CPUs  0.45 ms on 128 CPUs On the SGI Altix  0.22 ms on 16 CPUs  0.45 ms on 64 CPUs  0.62 ms for 128 CPUs. If the processors for an MPI job on the AlphaServer SC are not contiguous, it uses software broadcast, which is much slower than Altix. MPI_Bcast Results

Distribution result for MPI_Bcast at 256KBytes on 32 processors. The binomial tree that constructs a 16 process software- based MPI Bcast from a collection of point-to- point messages.

Benchmark Comparison for MPI_Bcast Comparison between MPI benchmarks for MPI_Bcast on 8 processors.

The main difference is :-  SKaMPI, Mpptest and PMB assume data is not held in cache memory.  MPIBench did preliminary “warm-up” repetitions to ensure data is in cache before measurements are taken. The newer version already added an option to clear the cache. Another difference is:-  Measure collective communications time at the root node.  However for broadcast, the root node is the first to finish, and this may lead to biased results.  To solve the problem - Insert a barrier operation before each repetition.  But, Mpptest and PMB adopt a different approach – they assign a different root processor for each repetition.  On a distributed memory cluster this has little affect on the results.  However for shared memory moving the root to a different processor has a significant overhead.

MPI_Barrier  The times scale logarithmically with numbers of processors.  Hardware broadcast on the Quadrics means that a barrier operation on the Alphaserver SC is very fast and takes almost constant time of around 5-8 microseconds for 2 – 128 processors.  Which is similar to the Altix. Average time for an MPI barrier operation for 2 to 128 processors

MPI_Scatter Overall the time grows remarkably slowly with the data size. 1KByte per process - Altix is 4 to 6 times faster than APAC SC. 4Kbytes per process - Altix is 10 times faster than APAC SC. Performance for MPI_Scatter for 2 to 128 processors

For MPI_Gather, the time is proportional to the data size. The Altix gives significantly better results than APAC SC. 1 KByte per process – Altix is 2 to 4 times faster than APAC SC. 2 KByte per process – Altix is10 times faster than APAC SC. > 2 Kbytes per process  MPI on Alphaserver SC became unstable and crashed,  Altix continues to give good performance. MPI_Gather

Performance for MPI_Gather for 2 to 128 processors Figures shows average times to complete for MPI_Gather operation. Times is roughly proportional to data size for large data sizes.

Process 0 is by far the slowest process to complete,  It has to gather and merge results from all other processors. Process 1 is the first to complete,  It is on the same node as the root process. Probability distribution for 64 processors and at 4KBytes per process.

MPI_Alltoall The times for MPI_Alltoall are significantly better on the Altix than the Alphaserver SC. 1 KByte per processor - Altix is 2 to 4 times faster than APAC SC. 4 KByte per processor - Altix is 20 times faster than APAC SC. 8 KByte per processor - Altix is 30 times faster than APAC SC. This is partly because the MPI implementation on the APAC SC did not appear to be optimized for SMP nodes.

Distribution for MPI_Alltoall for 32 processor at 256KBytes Distribution figures shows that for large messages, there is wide range of completion times. This is indicate the contention effects. MPI_Alltoall Time Distribution

Summary The SGI Altix shows very good MPI communications performance.  Using process binding improves performance  Enabling single copy for MPI_Send greatly improves performance Overall performance was significantly better than the APAC SC. Altix provides higher bandwidth and lower latency than the Quadrics network on the APAC SC. Altix has significantly better collective communications performance  except where SC can use Quadrics hardware for broadcast and barrier Different MPI benchmarks can give very different results for some operations  much greater variation on the Altix than for clusters.  mainly due to cache effects, which are important on ccNUMA

END