Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Parallel Research at Illinois Parallel Everywhere
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
Introductory Courses in High Performance Computing at Illinois David Padua.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Performance Engineering and Debugging HPC Applications David Skinner
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.
1 Computer Performance: Metrics, Measurement, & Evaluation.
CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
RAM, PRAM, and LogP models
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up Parallel I/O on the SP David Skinner, NERSC Division, Berkeley Lab.
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
MPI Performance in a Production Environment David E. Skinner, NERSC User Services ScicomP 10 Aug 12, 2004.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.
Outline Why this subject? What is High Performance Computing?
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Measuring Performance II and Logic Design
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Algorithm Design
EE 193: Parallel Computing
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Department of Computer Science University of California, Santa Barbara
Distributed Systems CS
CS510 - Portland State University
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4 Multiprocessors
Chapter 01: Introduction
Department of Computer Science University of California, Santa Barbara
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL

Parallel Scaling of MPI Codes A practical talk on using MPI with focus on: Distribution of work within a parallel program Placement of computation within a parallel computer Performance costs of different types of communication Understanding scaling performance terminology

Topics Application Scaling Load Balance Synchronization Simple stuff File I/O

Scale: Practical Importance Time required to compute the NxN matrix product C=A*B Assuming you can address 64GB from one task, can you wait a month? How to balance computational goal vs. compute resources? Choose the right scale!

Let’s jump to an example Sharks and Fish II : N 2 parallel force evalulation e.g. 4 CPUs evaluate force for 125 fish Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) Is it not possible to uniformly decompose problems in general, especially in many dimensions This toy problem is simple, has fine granularity and is 2D Let’s see how it scales 31 32

Sharks and Fish II : Program Data: n_fish  global my_fish  local fish i = {x, y, …} Dynamics: F = ma … V = Σ 1/r ij dq/dt = m * p dp/dt = -dV/dq MPI_Allgatherv(myfish_buf, len[rank], MPI_FishType…) for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j a i += g * mass j * ( fish i – fish j ) / r ij } Move fish

100 fish can move 1000 steps in 1 task  5.459s 32 tasks  2.756s 1000 fish can move 1000 steps in 1 task  s 32 tasks  s So what’s the “best” way to run? –How many fish do we really have? –How large a computer (time) do we have? –How quickly do we need the answer? Sharks and Fish II: How fast? x 24.6 speedup x 1.98 speedup

Scaling: Good 1 st Step: Do runtimes make sense? 1 Task 32 Tasks … Running fish_sim for fish on 1-32 CPUs we see time ~ fish 2

Scaling: Walltimes walltime is (all)important but let’s look at some other scaling metrics Each line is contour describing computations doable in a given time

Scaling: terminology Scaling studies involve changing the degree of parallelism. Will we be changing the problem also? –Strong scaling  Fixed problem size –Weak scaling  Problem size grows with additional compute resources How do we measure success in parallel scaling? –Speed up = T s /T p (n) –Efficiency = T s /(n*T p (n)) Multiple definitions exist!

Scaling: Speedups

Scaling: Efficiencies Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex

Scaling: Analysis Why does efficiency drop? –Serial code sections  Amdahl’s law –Surface to Volume  Communication bound –Algorithm complexity or switching –Communication protocol switching –Scalability of computer and interconnect  Whoa!

Scaling: Analysis In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. In general, first bottleneck wins. Scaling brings additional resources too. –More CPUs (of course) –More cache(s) –More memory BW in some cases

Scaling: Superlinear Speedup # CPUs (OMP)

Strong Scaling: Communication Bound 64 tasks, 52% comm 192 tasks, 66% comm 768 tasks, 79% comm MPI_Allreduce buffer size is 32 bytes. Q: What resource is being depleted here? A: Small message latency 1)Compute per task is decreasing 2)Synchronization rate is increasing 3)Surface:Volume ratio is increasing

Sharks and Atoms: At HPC centers like NERSC fish are rarely modeled as point masses. The associated algorithms and their scalings are none the less of great practical importance for scientific problems. Particle Mesh Ewald MatSci Computation way or way

Topics Load Balance Synchronization Simple stuff File I/O Now instead of looking at scaling of specific applications lets look at general issues in parallel application scalability

Load Balance : Application Cartoon Universal App Unbalanced: Balanced: Time saved by load balance Will define synchronization later

Load Balance : performance data MPI ranks sorted by total communication time Communication Time: 64 tasks show 200s, 960 tasks show 230s

Load Balance: ~code while(1) { do_flops(N i ); MPI_Alltoall(); MPI_Allreduce(); } 960 x 64 x

Load Balance: real code Sync Flops Exchange Time  MPI Rank 

Load Balance : analysis The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks This leads to CPU*seconds (8 CPU*hours) of unproductive computing All load imbalance requires is one slow task and a synchronizing collective! Pair well problem size and concurrency. Parallel computers allow you to waste time faster!

Load Balance : FFT Q: When is imbalance good? A: When is leads to a faster Algorithm.

Load Balance: Summary Imbalance is most often a byproduct of data decomposition Must be addressed before further MPI tuning can happen Good software exists for graph partitioning / remeshing For regular grids consider padding or contracting

Topics Load Balance Synchronization Simple stuff File I/O

Scaling of MPI_Barrier() four orders of magnitude

Synchronization: terminology MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; For a code running on N tasks what is the distribution of the T2’s? The average and width of this distribution tell us how synchronizing e.g. MPI_Allreduce is relative to some given interconnect. (HW & SW) How synchronizing is MPI_Allreduce?

Synchronization : MPI Functions Completion semantics of MPI functions Local : leave based on local logic –MPI_Comm_rank, MPI_Get_count Probably Local : try to leave w/o messaging other tasks –MPI_Isend/Irecv Partially synchronizing : leave after messaging M<N tasks –MPI_Bcast, MPI_Reduce Fully synchronizing : leave after every else enters –MPI_Barrier, MPI_Allreduce

seaborg.nersc.gov It’s hard to discuss synchronization outside of the context a particular parallel computer MPI timings depend on HW, SW, and environment –How much of MPI is handled by the switch adapter? –How big are messaging buffers? –How many thread locks per function? –How noisy is the machine (today)? This is hard to model, so take an empirical approach based on an IBM SP which is largely applicable to other clusters…

Colony Switch PGFS seaborg.nersc.gov basics ResourceSpeedBytes Registers 3 ns2560 B L1 Cache 5 ns 32 KB L2 Cache 45 ns 8 MB Main Memory300 ns 16 GB Remote Memory 19 us 7 TB GPFS 10 ms 50 TB HPSS 5 s 9 PB 380 x HPS S CSS0 CSS dedicated CPUs, 96 shared login CPUs Hierarchy of caching, speeds not balanced Bottleneck determined by first depleted resource 16 way SMP NHII Node Main Memory GPFS IBM SP

Colony Switch PGFS MPI on the IBM SP HPS S CSS0 CSS1 16 way SMP NHII Node Main Memory GPFS way concurrency MPI-1 and ~MPI-2 GPFS aware MPI-IO Thread safety Ranks on same node bypass the switch

MPI: seaborg.nersc.gov Intra and Inter Node Communication MP_EUIDEVICE (fabric) Bandwidth (MB/sec) Latency (usec) css0500 / 3509 / 21 css1XX csss500 / 3509 / 21 Lower latency  can satisfy more syncs/sec What is the benefit of two adapters? This is for a single pair of tasks 16 way SMP NHII Node Main Memory GPFS css0 css1 csss

Seaborg : point to point messaging 16 way SMP NHII Node Main Memory GPFS 16 way SMP NHII Node Main Memory GPFS Intranode Internode Switch BW and latency are often stated in optimistic terms. The number and size of concurrent messages changes things. A fat tree / crossbar switch helps hide this.

Inter-Node Bandwidth  csss css0  Tune message size to optimize throughput Aggregate messages when possible

MPI Performance is often Hierarchical message size and task placement are key to performance Intra Inter

MPI: Latency not always 1 or 2 numbers The set of all possibly latencies describes the interconnect geometry from the application perspective

Synchronization: measurement MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? For a code running on N tasks what is the distribution of the T2’s? One can derive the level of synchronization from MPI algorithms. Instead let’s just measure …

Synchronization: MPI Collectives 2048 tasks Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call

Synchronization: Architecture t is the frequency kernel process scheduling Unix : cron et al. …and from the machine itself

Intrinsic Synchronization : Alltoall

Architecture makes a big difference!

This leads to variability in Execution Time

Synchronization : Summary As a programmer you can control –Which MPI calls you use (it’s not required to use them all). –Message sizes, Problem size (maybe) –The temporal granularity of synchronization, i.e., where do synchronization occur. Language writers and system architects control –How hard is it to do the above –The intrinsic amount of noise in the machine

Topics Load Balance Synchronization Simple stuff File I/O

Simple Stuff Parallel programs are easier to mess up than serial ones. Here are some common pitfalls.

What’s wrong here?

Is MPI_Barrier time bad? Probably. Is it avoidable? ~three cases: 1)The stray / unknown / debug barrier 2)The barrier which is masking compute balance 3)Barriers used for I/O ordering Often very easy to fix MPI_Barrier

Topics Load Balance Synchronization Simple stuff File I/O

Parallel File I/O : Strategies MPI Disk Some strategies fall down at scale

Parallel File I/O: Metadata A parallel file system is great, but it is also another place to create contention. Avoid uneeded disk I/O, know your file system Often avoid file per task I/O strategies when running at scale

Topics Load Balance Synchronization Simple stuff File I/O Happy Scaling!

Other sources of information: MPI Performance: Seaborg MPI Scaling: MPI Synchronization : Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, "The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q", in Proc. SuperComputing, Phoenix, November Domain decomposition: google://”space filling”&”decomposition” etc. Metis :

Dynamical Load Balance: Motivation Time  MPI Rank  Sync Flops Exchange