N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Introduction CSCI 444/544 Operating Systems Fall 2008.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,

Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

W4118 Operating Systems OS Overview Junfeng Yang.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.

Performance Engineering and Debugging HPC Applications David Skinner

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Redundant Array of Independent Disks

Computer System Architectures Computer System Software

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Chapter 1. Introduction What is an Operating System? Mainframe Systems

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up Parallel I/O on the SP David Skinner, NERSC Division, Berkeley Lab.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

MPI Performance in a Production Environment David E. Skinner, NERSC User Services ScicomP 10 Aug 12, 2004.

A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Background Computer System Architectures Computer System Software.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.

Getting the Most out of Scientific Computing Resources

Getting the Most out of Scientific Computing Resources

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Alternative system models

Parallel Programming By J. H. Wang May 2, 2017.

Is System X for Me? Cal Ribbens Computer Science Department

Parallel Programming in C with MPI and OpenMP

OffLine Physics Computing

湖南大学-信息科学与工程学院-计算机与科学系

Department of Computer Science University of California, Santa Barbara

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

Operating System Overview

Support for Adaptivity in ARMCI Using Migratable Objects

Lecture Topics: 11/1 Hand back midterms

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 2 Scaling: Motivation NERSC’s focus is on capability computation –Capability == jobs that use ¼ or more of the machines resources Parallelism can deliver scientific results unattainable on workstations. “Big Science” problems are more interesting!

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 Scaling: Challenges CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated. Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes –MPI implementation –Filesystem metadata systems –Batch queue system NERSC consultants can help Users need information on how to mitigate the impact of these issues for large concurrency applications.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 Seaborg.nersc.gov MP_EUIDEVICE (switch fabric) MPI Bandwidth (MB/sec) MPI Latency (usec) css0500 / 3508 / 16 css1 csss500 / 350 (single task) 8 / 16

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 Switch Adapter Bandwidth: csss

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 Switch Adapter Comparison  csss css0  Tune message size to optimize throughput

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 Switch Adapter Considerations For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio) Use MP_SHAREDMEMORY to minimize switch traffic csss is most often the best route to the switch

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 Job Start Up times

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 Synchronization On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks A fully synchronizing MPI call requires everyone’s attention By analogy, imagine trying to go to lunch with 1024 people Probability that everyone is ready at any given time scales poorly

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 Scaling of MPI_Barrier()

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 Load Balance If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall Seek out and eliminate sources of variation Distribute problem uniformly among nodes/cpus

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12 Synchronization: MPI_Bcast 2048 tasks

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 13 Synchronization: MPI_Alltoall 2048 tasks

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 14 Synchronization (continued) MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above Use MPI_Bcast if possible which is not fully synchronizing Remove un-needed MPI_Barrier calls Use Immediate Sends and Asynchronous I/O when possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 Improving MPI Scaling on Seaborg

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 The SP switch Use MP_SHAREDMEMORY=yes (default) Use MP_EUIDEVICE=csss (default) Tune message sizes Reduce synchronizing MPI calls

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER bit MPI 32 bit MPI has inconvenient memory limits –256MB per task default and 2GB maximum –1.7GB can be used in practice, but depends on MPI usage –The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes 64 bit MPI removes these barriers –64 bit MPI is fully supported –Just remember to use “_r” compilers and “-q64” Seaborg has 16,32, and 64 GB per node available

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 How to measure MPI memory usage? 2048 tasks

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 MP_PIPE_SIZE : 2*PIPE_SIZE*(ntasks-1)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 OpenMP Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation, e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21 Beware Hidden Multithreading ESSL and IBM Fortran have autotasking like “features” which function via creation of unspecified numbers of threads. Fortran RANDOM_NUMBER intrinsic has some well known scaling problems. XLF, use threads to auto parallelize my code “-qsmp=auto”. ESSL, libesslsmp.a has an autotasking feature Synchronization problems are unpredictable using these features. Performance impacted when too many threads.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22 MP_LABELIO, phost Labeled I/O will let you know which task generated the message “segmentation fault”, gave wrong answer, etc. export MP_LABELIO=yes Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks –MPI and LAPI versions available –Hostslists are useful in general

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23 Core files Core dumps don’t scale (no parallel work) MP_COREDIR=none  No corefile I/O MP_COREFILE_FORMAT=light_core  Less I/O LL script to save just one full fledged core file, throw away others … if MP_CHILD !=0 export MP_COREDIR=/dev/null endif …

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24 Debugging In general debugging 512 and above is error prone and cumbersome. Debug at a smaller scale when possible. Use shared memory device MPICH on a workstation with lots of memory as a mock up high concurrency environment. For crashed jobs examine LL logs for memory usage history. (ask a NERSC consultant for help with this)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 25 Parallel I/O Can be a significant source of variation in task completion prior to synchronization Limit the number of readers or writers when appropriate. Pay attention to file creation rates. Output reduced quantities when possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 26 Summary Resources are present to face the challenges posed by scaling up MPI applications on seaborg. Hopefully, scientists will expand their problem scopes to tackle increasingly challenging computational problems. NERSC consultants can provide help in achieving scaling goals.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 27

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 28 Scaling of Parallel I/O on GPFS

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 29 Motivation NERSC uses GPFS for $HOME and $SCRATCH Local disk filesystems on seaborg (/tmp) are tiny Growing data sizes and concurrencies often outpace I/O methodologies

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 30 Each compute node relies on the GPFS nodes as gateways to storage 16 nodes are dedicated to serving GPFS filesystems

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 31 Common Problems when Implementing Parallel IO CPU utilization suffers as time is lost to I/O Variation in write times can be severe, leading to batch job failure

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 32 Finding solutions Checkpoint (saving state) IO pattern Survey strategies to determine the rate and variation in rate

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 33

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 34 Parallel I/O Strategies

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 35 Multiple File I/O if(private_dir) rank_dir(1,rank); fp=fopen(fname_r,"w"); fwrite(data,nbyte,1,fp); fclose(fp); if(private_dir) rank_dir(0,rank); MPI_Barrier(MPI_COMM_WORLD);

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 36 Single File I/O fd=open(fname,O_CREAT|O_RDWR, S_IRUSR); lseek(fd,(off_t)(rank*nbyte)-1,SEEK_SET); write(fd,data,1); close(fd);

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 37 MPI-I/O MPI_Info_set(mpiio_file_hints, MPIIO_FILE_HINT0); MPI_File_open(MPI_COMM_WORLD, fname, MPI_MODE_CREATE | MPI_MODE_RDWR, mpiio_file_hints, &fh); MPI_File_set_view(fh, (off_t)rank*(off_t)nbyte, MPI_DOUBLE, MPI_DOUBLE, "native", mpiio_file_hints); MPI_File_write_all(fh, data, ndata, MPI_DOUBLE, &status); MPI_File_close(&fh);

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 38 Results

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 39 Scaling of single file I/O

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 40 Scaling of multiple file and MPI I/O

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 41 Large block I/O MPI I/O on the SP includes the file hint IBM_largeblock_io IBM_largeblock_io=true used throughout, default values show large variation IBM_largeblock_io=true also turns off data shipping

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 42 Large block I/O = false MPI on the SP includes the file hint IBM_largeblock_io Except above IBM_largeblock_io=true used throughout IBM_largeblock_io=true also turns off data shipping

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 43 Bottlenecks to scaling Single file I/O has a tendency to serialize Scaling up with multiple files create filesystem problems Akin to data shipping consider the intermediate case

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 44 Parallel IO with SMP aggregation (32 tasks)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 45 Parallel IO with SMP aggregation (512 tasks)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 46 Summary MB 10 MB 100 MB 1 GB 10 G 100 G Serial Multiple File mod n MPI IO MPI IO collective

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 47

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 48