N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 2 Scaling: Motivation NERSC’s focus is on capability computation –Capability == jobs that use ¼ or more of the machines resources Parallelism can deliver scientific results unattainable on workstations. “Big Science” problems are more interesting!

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 Scaling: Challenges CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated. Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes –MPI implementation –Filesystem metadata systems –Batch queue system NERSC consultants can help Users need information on how to mitigate the impact of these issues for large concurrency applications.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 Seaborg.nersc.gov MP_EUIDEVICE (switch fabric) MPI Bandwidth (MB/sec) MPI Latency (usec) css0500 / 3508 / 16 css1 csss500 / 350 (single task) 8 / 16

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 Switch Adapter Bandwidth: csss

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 Switch Adapter Comparison  csss css0  Tune message size to optimize throughput

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 Switch Adapter Considerations For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio) Use MP_SHAREDMEMORY to minimize switch traffic csss is most often the best route to the switch

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 Job Start Up times

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 Synchronization On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks A fully synchronizing MPI call requires everyone’s attention By analogy, imagine trying to go to lunch with 1024 people Probability that everyone is ready at any given time scales poorly

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 Scaling of MPI_Barrier()

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 Load Balance If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall Seek out and eliminate sources of variation Distribute problem uniformly among nodes/cpus

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12 Synchronization: MPI_Bcast 2048 tasks

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 13 Synchronization: MPI_Alltoall 2048 tasks

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 14 Synchronization (continued) MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above Use MPI_Bcast if possible which is not fully synchronizing Remove un-needed MPI_Barrier calls Use Immediate Sends and Asynchronous I/O when possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 Improving MPI Scaling on Seaborg

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 The SP switch Use MP_SHAREDMEMORY=yes (default) Use MP_EUIDEVICE=csss (default) Tune message sizes Reduce synchronizing MPI calls

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 17 64 bit MPI 32 bit MPI has inconvenient memory limits –256MB per task default and 2GB maximum –1.7GB can be used in practice, but depends on MPI usage –The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes 64 bit MPI removes these barriers –64 bit MPI is fully supported –Just remember to use “_r” compilers and “-q64” Seaborg has 16,32, and 64 GB per node available

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 How to measure MPI memory usage? 2048 tasks

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 MP_PIPE_SIZE : 2*PIPE_SIZE*(ntasks-1)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 OpenMP Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation, e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21 Beware Hidden Multithreading ESSL and IBM Fortran have autotasking like “features” which function via creation of unspecified numbers of threads. Fortran RANDOM_NUMBER intrinsic has some well known scaling problems. http://www.nersc.gov/projects/scaling/random_number.html XLF, use threads to auto parallelize my code “-qsmp=auto”. ESSL, libesslsmp.a has an autotasking feature Synchronization problems are unpredictable using these features. Performance impacted when too many threads.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22 MP_LABELIO, phost Labeled I/O will let you know which task generated the message “segmentation fault”, gave wrong answer, etc. export MP_LABELIO=yes Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks –MPI and LAPI versions available –Hostslists are useful in general

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23 Core files Core dumps don’t scale (no parallel work) MP_COREDIR=none  No corefile I/O MP_COREFILE_FORMAT=light_core  Less I/O LL script to save just one full fledged core file, throw away others … if MP_CHILD !=0 export MP_COREDIR=/dev/null endif …

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24 Debugging In general debugging 512 and above is error prone and cumbersome. Debug at a smaller scale when possible. Use shared memory device MPICH on a workstation with lots of memory as a mock up high concurrency environment. For crashed jobs examine LL logs for memory usage history. (ask a NERSC consultant for help with this)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 25 Parallel I/O Can be a significant source of variation in task completion prior to synchronization Limit the number of readers or writers when appropriate. Pay attention to file creation rates. Output reduced quantities when possible

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 26 Summary Resources are present to face the challenges posed by scaling up MPI applications on seaborg. Hopefully, scientists will expand their problem scopes to tackle increasingly challenging computational problems. NERSC consultants can provide help in achieving scaling goals.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 27

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 28 Scaling of Parallel I/O on GPFS

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 29 Motivation NERSC uses GPFS for $HOME and $SCRATCH Local disk filesystems on seaborg (/tmp) are tiny Growing data sizes and concurrencies often outpace I/O methodologies

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 30 GPFS@Seaborg.nersc.gov Each compute node relies on the GPFS nodes as gateways to storage 16 nodes are dedicated to serving GPFS filesystems

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 31 Common Problems when Implementing Parallel IO CPU utilization suffers as time is lost to I/O Variation in write times can be severe, leading to batch job failure

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 32 Finding solutions Checkpoint (saving state) IO pattern Survey strategies to determine the rate and variation in rate

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 34 Parallel I/O Strategies

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 35 Multiple File I/O if(private_dir) rank_dir(1,rank); fp=fopen(fname_r,"w"); fwrite(data,nbyte,1,fp); fclose(fp); if(private_dir) rank_dir(0,rank); MPI_Barrier(MPI_COMM_WORLD);

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 36 Single File I/O fd=open(fname,O_CREAT|O_RDWR, S_IRUSR); lseek(fd,(off_t)(rank*nbyte)-1,SEEK_SET); write(fd,data,1); close(fd);

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 37 MPI-I/O MPI_Info_set(mpiio_file_hints, MPIIO_FILE_HINT0); MPI_File_open(MPI_COMM_WORLD, fname, MPI_MODE_CREATE | MPI_MODE_RDWR, mpiio_file_hints, &fh); MPI_File_set_view(fh, (off_t)rank*(off_t)nbyte, MPI_DOUBLE, MPI_DOUBLE, "native", mpiio_file_hints); MPI_File_write_all(fh, data, ndata, MPI_DOUBLE, &status); MPI_File_close(&fh);

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 38 Results

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 39 Scaling of single file I/O

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 40 Scaling of multiple file and MPI I/O

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 41 Large block I/O MPI I/O on the SP includes the file hint IBM_largeblock_io IBM_largeblock_io=true used throughout, default values show large variation IBM_largeblock_io=true also turns off data shipping

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 42 Large block I/O = false MPI on the SP includes the file hint IBM_largeblock_io Except above IBM_largeblock_io=true used throughout IBM_largeblock_io=true also turns off data shipping

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 43 Bottlenecks to scaling Single file I/O has a tendency to serialize Scaling up with multiple files create filesystem problems Akin to data shipping consider the intermediate case

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 44 Parallel IO with SMP aggregation (32 tasks)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 45 Parallel IO with SMP aggregation (512 tasks)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 46 Summary 2048 1024 512 256 128 64 32 16 1 MB 10 MB 100 MB 1 GB 10 G 100 G Serial Multiple File mod n MPI IO MPI IO collective

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.

Similar presentations

Presentation on theme: "N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab.

Similar presentations

Presentation on theme: "N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Scaling Up MPI and MPI-I/O on seaborg.nersc.gov David Skinner, NERSC Division, Berkeley Lab."— Presentation transcript:

Similar presentations

About project

Feedback