1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology UCRL-PRES August 7, 2004 This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. Presented to GP 2 Workshop
2 Aug 7, 2004 GPU Req Overview Code Characteristics Hardware requirements Software requirements Runtime requirements Bringa, 350MAtoms, 1,944 1GiB, 50 TB output, 7 Days. 10K LOC, 35% PE efficiency, 95% parallel efficiency Gilmer, 10MAtoms, GiB, 40(110 GB output, 48 HR). 10K LOC, 35% PE efficiency, 95% parallel efficiency
3 Aug 7, 2004 GPU Req Simulation’s value is dependent on the other elements of the integrated program Simulations and experimental program are tightly coupled for overall confidence in the stockpile
4 Aug 7, 2004 GPU Req Code Characteristics Complex multi-physics package applications Typically solving multiple types of PDEs Time evolution calculations (100K time steps weeks of runtime) Non-linear solves in each package (100s) Linear solves within non-linear solve (1Ks) Multiple physical properties databases Languages include C, C++, Fortran90, Python 50K-1.5M LOC Heavy use of complex structures and C++ templates Need programming model and platform architecture stability for horizontal (platform) and vertical (time-dependent) portability Very complex makefiles, controllers (perl & python) and pre- and post-processing Designed and written from the ground up for MPI and OpenMP style parallelism Targeted at hierarchical memory systems Lots of low level parallelism left to be exploited, but with short vector lengths Written by large (5-25 people) teams Core physics physicists Computer scientists Mathematicians TImespans 3-5 years to develop, 10 years usage, 5-10 years legacy Constant evolution of codes to add physics features, debug, improved validation and databases
5 Aug 7, 2004 GPU Req Application Performance Characteristics Node code No hot spots – e.g., package has 20 routines with 5% runtime each Compute intensive with 5-35% performance efficiency 5-20% FMA Random access and block access memory patterns Most don’t have math library (e.g., BLAS3) usage Typically use GiB of memory MPI Long and short messages, depending on package Exchanges for FEM Random connections for sparse matrix ops Highly dependent on Barrier, ALL_REDUCE
6 Aug 7, 2004 GPU Req GPU hardware requirements 64b arithmetic predominates, but some 32b is acceptable Need better IEEE arithmetic Better FP behavior, not full compliance Exception generation mechanism Large memory and access to node memory Streaming access to node memory Random access and block access modes Reduced texture memory restrictions Efficient Gather and Scatter mechanisms Short vectors low overhead to start parallelism Conditional execution essential for vectorization of if-tests
7 Aug 7, 2004 GPU Req GPU software requirements Languages The closer to C and C++ the better Porting to OpenGL is not an option Challenge is to be able to express data parallelism (streams) in portable C Ability to debug essential How to efficiently utilize multiple GPUs? Multiple levels of parallelism (data parallel, mGPU, GPU-CPU, mCPU) Open source Device drivers, compilers, debuggers, etc
8 Aug 7, 2004 GPU Req Runtime requirements Dynamically load programs into GPU with dynamic linked libraries Need exception mechanism Ability to cleanly map node memory into GPU memory Move data with portable constructs
9 Aug 7, 2004 GPU Req Possible approaches HPC market potential could be used to induce vendors to improve environment 1K clusters have 2-4K slots for GPUs… Large market Libraries Not widely used Few key applications could benefit Key functions Monte carlo random number generation EOS evaluation utilizing “free” interpolation FEM Elem by Elem Mx operations Secondary calculations (diagnostics, visualization) Work with early adopters
10 Aug 7, 2004 GPU Req Conclusions Large scientific simulations have enormous computational requirements GPUs offer unique capabilities and are becoming more usable Wide spread adoption awaits more general purpose usability Langer, 6.8 TZone, Laser Plasma interaction, GiB, 10 days for 35 pico-sec, 14 TB vis data Woodward, 8 BZone, TurbHydro, 2K 1.5GiB, 25 days 25 TB vis data