N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services2 Original Plans: The NERSC-3 Procurement Complete, reliable, high-end scientific system High availability and MTBF Fully configured - processing, storage, software, networking, support Commercially available components The greatest amount of computational power for the money Can be integrated with existing computing environment Can be evolved with product line Much careful benchmarking and acceptance testing done
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services3 Original Plans: The NERSC-3 Procurement What we wanted: –>1 teraflop of peak performance –10 terabytes of storage –1 terabyte of memory What we got in phase 1 –410 gigaflops of peak performance –10 terabytes of storage –512 gigabytes of memory What we will get in phase 2 –3 teraflops of peak performance –15 terabytes of storage –1 terabyte of memory
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services4 Hardware, Phase Power 3+ nodes: Nighthawk 1 –Node usage: 256 compute/batch nodes = 512 CPUs 8 login nodes = 16 CPUs 16 GPFS nodes = 32 CPUs 8 network nodes = 16 CPUs 16 service nodes = 32 CPUs –2 processors/node –200 MHz clock –4 flops/clock (2 multiply-add ops) = 800 Mflops/CPU, 1.6 Gflops/node –64 KB L-1 d-cache per 5 nsec & 3.2 GB/sec –4 MB L-2 cache per 45 nsec & 6.4 GB/sec –1 GB RAM per 175 nsec & 1.6 GB/sec –150 MB/sec switch bandwidth –9 GB local disk (two-way RAID)
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services5 Hardware, Phase Power 3+ nodes: Winterhawk 2 –Node usage: 128 compute/batch nodes = 2048 CPUs 2 login nodes = 32 CPUs 16 GPFS nodes = 256 CPUs 2 network nodes = 32 CPUs 4 service nodes = 64 CPUs –16 processors/node –375 MHz clock –4 flops/clock (2 multiply-add ops) = 1.5 Gflops/CPU, 22.4 Gflops/node –64 KB L-1 d-cache per 5 nsec & 3.2 GB/sec –8 MB L-2 cache per 45 nsec & 6.4 GB/sec –8 GB RAM per 175 nsec & 14.0 GB/sec –~2000 (?) MB/sec switch bandwidth –9 GB local disk (two-way RAID)
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services6 Programming Models, phase 1 Phase 1 will reply on MPI, with availability of threading –OpenMP directives –Pthreads –IBM SMP directives MPI now does intra-node communications efficiently Mixed-model programming not currently very advantageous PVM and LAPI messaging systems are also available SHMEM is “planned”… The SP has cache and virtual memory, which means –There are more ways to reduce code performance –There are more ways to lose portability
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services7 Programming Models, phase 2 Phase 2 will offer more payback for mixed model programming –Single node parallelism is a good target for PVP users –Vector and shared-memory codes can be “expanded” into MPI –MPI codes can be ported from the T3E –Threading can be added within MPI In both cases, re-engineering will be required, to exploit new and different levels of granularity This can be done along with increasing problem sizes
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services8 Porting Considerations, part 1 Things to watch out for in porting codes to the SP –Cache Not enough on the T3E to make worrying about it worth the trouble Enough on the SP to boost performance, if it’s used well Tuning for cache is different than tuning for vectorization False sharing caused by cache can reduce perfomrance –Virtual memory Gives you access to 1.75 GB of (virtual) RAM address space To use all of virtual (or even real) memory, must explicitly request “segments” Causes performance degradation due to paging –Data types Default sizes are different on PVP, T3E, and SP systems “ integer ”, “ int ”, “ real ”, and “ float ” must be used carefully Best to say what you mean: “ real*8 ”, integer*4 ” Do the same in MPI calls: “ MPI_REAL8 ”, “ MPI_INTEGER4 ” Be careful with intrinsic function use, as well
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services9 Porting Considerations, part 2 More things to watch out for in porting codes to the SP –Arithmetic Architecture tuning can help exploit special processor instructions Both T3E and SP can optimize beyond IEEE arithmetic T3E and PVP can also do fast reduced precision arithmetic Compiler options on T3E and SP can force IEEE compliance Compiler options can also throttle other optimizations for safety Special libraries offer faster intrinsics –MPI SP compilers and runtime will catch loose usage that was accepted on the T3E Communication bandwidth on SP Phase 1 is lower than on the T3E Message latency on the SP Phase 1 is higher than on the T3E We expect approximate parity with T3E in these areas, on the Phase 2 system Limited number of communication ports per node - approximately one per CPU “Default” versus “eager” buffer management in MPI_SEND
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services10 Porting Considerations, part 3 Compiling & linking –“Version” is dependent on language and parallelization scheme Language version –Fortran 77: f77, xlf –Fortran 90: xlf90 –Fortran 95: xlf95 –C: cc, xlc, c89 –C++: xlC MPI-included: mpxlf, mpxlf90, mpcc, mpCC Thread-safe: xlf_r, xlf90_r, xlf95_r, mpxlf_r, mpxlf90_r –Preprocessing can be ordered by compiler flag or source file suffix Use consistently, for all related compilations; the following may NOT produce a parallel executable: mpxlf90 -c *.F xlf90 -o foo *.o Use -bmaxdata:bytes option to get more than a single 256 MB segment (up to 7 segments, or ~1.75 GB can be specified; only 3, or 0.75 GB, are real)
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services11 Porting: MPI MPI codes should port relatively well Use one MPI task per node or processor –One per node during porting –One per processor during production –Let MPI worry about where it’s communicating to –Environment variables, execution parameters, and/or batch options can specify # tasks per node Total # tasks Total # processors Total # nodes Communications subsystem in use –User Space is best in batch jobs –IP may be best for interactive developmental runs There is a debug queue/class in batch
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services12 Porting: Shared Memory Don’t throw away old shared memory directives –OpenMP will work as is –Cray Tasking directives will be useful for documentation –We recommend porting Cray directives to OpenMP –Even small-scale parallelism can be useful –Larger scale parallelism will be available next year If your problems and/or algorithms will scale to larger granularities and greater parallelism, prepare for message passing –We recommend MPI
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services13 From Loop-slicing to MPI, before... allocate(A(1:imax,1:jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, imax, jmax) do I = 1, imax do J = 1, jmax A(I,J) = deep_thought(A, I, J,…) enddo Sanity checking –Run the program on one CPU to get baseline answers –Run on several CPUs to see parallel speedups and answers Optimization –Consider changing memory access patterns to improve cache usage –How big can your problem get before you run out of real memory?
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services14 From Loop-slicing to MPI, after... call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) call my_indices(my_id, nprocs, my_imin, my_imax, my_jmin, my_jmax) allocate(A(my_imin : my_imax, my_jmin : my_jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, my_imax, my_jmax, my_imax, my_jmax) do I = my_imin, my_imax do J = my_jmin, my_jmax A(I,J) = deep_thought(A, I, J,…) enddo ! Communicate the shared values with neighbors… if(odd(my_ID)) then call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) else call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) endif
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services15 From Loop-slicing to MPI, after... You now have one MPI task and many OpenMP threads per node –The MPI task does all the communicating between nodes –The OpenMP threads do the parallelizable work –Do NOT use MPI within an OpenMP parallel region Sanity checking –Run on one node and one CPU to check baseline answers –Run on one node and several CPUs to see parallel speedup and answers –Run on several nodes, one CPU per node, and check answers –Run on several nodes, several CPUs per node, and check answers Scaling checking –Run a larger version of a similar problem on the same set of ensemble sizes –Run the same sized problem on a larger ensemble (Re-)Consider your I/O strategy…
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services16 From MPI to Loop-slicing Add OpenMP directives to existing code Perform sanity and scaling checks, as before Results in same overall code structure as on previous slides –One MPI task and several OpenMP threads per node For irregular codes, Pthreads may serve better, at the cost of increased complexity Nobody really expects it to be this easy...
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services17 Using the Machine, part 1 Somewhat similar to the Crays –Interactive and batch jobs are possible
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services18 Using the Machine, part 2 Interactive runs –Sequential executions run immediately on your login node –Every login will likely put you on a different node, so be careful about looking for your executions - “ ps ” returns info about only the node you’re logged into. –Small scale parallel jobs may be rejected if LoadLeveler can’t find the resources –There are two pools of nodes that can be used for interactive jobs: Login nodes A small subset of the compute nodes –Parallel execution can often be achieved by Trying again, after initial rejection Changing communication mechanisms from User Space to IP Using the other pool
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services19 Using the Machine, part 3 Batch jobs –Currently, very similar in capability to the T3E Similar run times, processor counts More memory available on the SP –Limits and capabilities may change, as we learn the machine –LoadLeveler is similar to, but simpler than NQE/NQS on the T3E Jobs are submitted, monitored, and cancelled by special commands Each batch job requires a script that is essentially a shell script The first few lines contain batch options that look like comments to the shell The rest of the script can contain any shell constructs Scripts can be debugged by executing them interactively Users are limited to 3 running jobs, 10 queued jobs, and 30 submitted jobs, at any given time
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services20 Using the Machine, part 4 File systems –Use the environment variables to let the system manage your file usage –Sequential work can be done in $HOME (not backed up) or $TMPDIR (transient) Medium performance, node-local –Parallel work can be done in $SCRATCH (transient) or /scratch/username (purgeable) High performance, located in GPFS –HPSS is available from batch jobs, via HSI, and interactively via FTP, PFTP, and HIS –There are quotas on space and inode usage
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER June 2000SP Evolution - NERSC User Services21 Using the Machine, part 4 The future? –The allowed scale of parallelism (CPU counts) may change Max now = 512 CPUs, same as on T3E –The allowed duration of runs may change Max now = 4 hours; Max on T3E = 12 hours –The size of possible problems will definitely change More CPUs in phase 1 than the T3E More memory per cpu, in both phases, than on T3e –The amount of work possible per unit time will definitely change CPUs in both phases are faster than those on the T3E Phase 2 interconnect will be faster than on Phase 1 –Better machine management Checkpointing will be available We will learn what can be adjusted in the batch system –There will be more and better tools for monitoring and tuning HPM, KAP, Tau, PAPI... –Some current problems will go away (e.g. memory mapped files)