Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Spark: Cluster Computing with Working Sets

Tools for applications improvement George Bosilca.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Reference: Message Passing Fundamentals.

1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.

Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Beowulf Cluster Computing Each Computer in the cluster is equipped with: – Intel Core 2 Duo 6400 Processor(Master: Core 2 Duo 6700) – 2 Gigabytes of DDR.

Topic Overview One-to-All Broadcast and All-to-One Reduction

High Performance Cooperative Data Distribution [J. Rick Ramstetter, Stephen Jenks] [A scalable, parallel file distribution model conceptually based on.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

The hybird approach to programming clusters of multi-core architetures.

N-Tier Architecture.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.

Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.

Brierley 1 Module 4 Module 4 Introduction to LAN Switching.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

The Cluster Computing Project Robert L. Tureman Paul D. Camp Community College.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Introduction to OOP CPS235: Introduction.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Lesson 1 1 LESSON 1 l Background information l Introduction to Java Introduction and a Taste of Java.

CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.

Exploring Parallelism with Joseph Pantoga Jon Simington.

Computer Network Architecture Lecture 2: Fundamental of Network.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Background Computer System Architectures Computer System Software.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

These slides are based on the book:

Community Grids Laboratory

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Distributed Shared Memory

Parallel Programming By J. H. Wang May 2, 2017.

Introduction to client/server architecture

CMSC 611: Advanced Computer Architecture

Introduction to Multiprocessors

MPJ: A Java-based Parallel Computing System

Chapter 4 Multiprocessors

Presentation transcript:

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

2Communication Why is AlltoAllv important Large scale changes of data decomposition are usually implemented using the MPI_Alltoall or MPI_Alltoallv calls –Alltoallv is equivalent to each process sending and message to every other process. –The Alltoall operation is similar but all messages are the same size. They are particularly important in the implementation of parallel multi-dimensional fourier transforms where a series of data transposes are required.

3Communication Optimising AlltoAllv All messages are unique –On the face of it very little room to optimise We can exploit the clustered SMP architecture –Messages within a node can be send with lower latency, higher bandwidth. –May be possible to redistribute data within a node. Send smaller number of larger messages between nodes Reduce message latency costs. Problems –Cost of data re-distribution must be less than the saved latencies. This approach will break down for sufficiently large messages. –Need to know how to redistribute the data.

4Communication Optimising the local redistribution. Two options: 1.Use MPI point to point 2.Communicate via a shared memory segment (SMS). Though shared memory point-to-point is fast we may be able to do better for collective operations where the communication pattern is known in advance. (No need to match send/recv calls) To investigate these options we built on previous work where we optimised tree based communications using a SMS. –Code is written to be switchable between MPI and SMS to allow comparisons.

5Communication Communication patterns. With the Alltoall call all participating processes know the full communication pattern. –This can therefore be implemented directly. With Alltoallv only the communications involving the local process are known. –Additional communication required to distribute information. –This would negate any advantage of the optimisation. –However most application codes will repeat a small number of data redistributions many times. Leads us to a “planned” alltoallv approach

6Communication Planned operations. Use the Alltoallv parameters to construct a Plan object. –This is collective. –Relatively expensive. Plan can be re-used multiple times to perform the communication step. Advantage: –Additional communications only performed once. Disadvantage: –Extension to the MPI standard, code must be refactored to use these calls.

7Communication Implementation Aim for only a single message to be passed between separate SMPs. Inter-box sends and receives equally distributed across processes within a box. Three stage process: –Collect data to sending process. –Exchange data between SMPs –Distribute data to correct end-point. Collect/Distribute perfomed by either 1.MPI_Pack/MPI_Unpack to SMS 2.MPI point-to-point

8Communication Plan objects Plan objects contain lists of communications for each stage. –Point to point communications stored as MPI persistent communication requests. –Pack/Unpack stored as linked list of structures. SMS version uses only a single segment per SMP –Designed to work robustly with split-communicators –Same segment used for the Tree based communications. –Wherever possible use normal MPI calls, overriding via the profiling interface.

9Communication Performance Optimisation effect should be greatest for large number of small messages. Benchmark the planned Alltoallv code using Alltoall equivalent communication patterns at a variety of message sizes. All benchmarks run on HPCx –32 way p690+ nodes – Service Pack 12

10Communication Single SMP results

11Communication Notes Within a single box only a 2 stage copy is used. –SMS version Packs/Unpacks from the segment –Pt-to-pt version uses single message send. Pt-to-pt version very similar to native Alltoallv –Actually slightly slower, Alltoallv may optimise the send to the local processor as a copy. SMS approach faster for message sizes up to 4K

12Communication Multiple SMP results

13Communication Notes Pt-to-pt version always slower –Under SP12 difference between shared memory and switch latencies does not seem great enough for this optimisation to work. Only a couple of microsecond difference in latencies much reduced from initial install. SMS version faster up to 2K messages With 32 way nodes we are combining 32*32=1024 messages

14Communication 512 processors

15Communication Small message scaling

16Communication Conclusions Message aggregation is a workable optimisation for Alltoallv IF: –Aggregation performed via a Shared memory segment. –Messages are small <~2K. –Communication pattern is calculated in advance. Optimisation is much more significant at large processor counts.

17Communication Future work Implement this optimisation for the MPI_Alltoall call directly (without the planning step). Investigate optimising the MPI_Alltoallv call using Plan caching. –It may be possible to use the existing MPI_Alltoallv interface by caching Plans within the communicator and performing lightweight collectives (Allreduce) to match stored Plans with the arguments of the MPI call. –Planning overhead only on first use of a communication pattern.