Presentation is loading. Please wait.

Presentation is loading. Please wait.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National.

Similar presentations


Presentation on theme: "C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National."— Presentation transcript:

1 C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National Laboratory Joint work with: Julian Borrill, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, John Shalf, David Skinner, Michael Wehner, Patrick Worley

2 Motivation  Stagnating application performance is well-know problem in scientific computing  By end of decade numerous mission critical applications expected to have 100X computational demands of current levels  Many HEC platforms are poorly balanced for demands of leading applications  Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology  Traditional superscalar trends slowing down  Mined most benefits of ILP and pipelining, Clock frequency limited by power concerns  Achieving next order of magnitude in computing will require O(100K) - O(1M) processors: interconnect that scale nonlinearly in cost (crossbar, fattree) become impractical

3 Application Evaluation  Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency  However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches  Our evaluation work emphasizes full applications, with real input data, at the appropriate scale  Requires coordination of computer scientists and application experts from highly diverse backgrounds  Allows advanced optimizations for given architectural platforms  In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale  Our recent efforts have focused on the potential of modern parallel vector systems  Effective code vectorization is an integral part of the process  First US team to conduct Earth Simulator performance study

4 Vector Paradigm  High memory bandwidth  Allows systems to effectively feed ALUs (high byte to flop ratio)  Flexible memory addressing modes  Supports fine grained strided and irregular data access  Vector Registers  Hide memory latency via deep pipelining of memory load/stores  Vector ISA  Single instruction specifies large number of identical operations  Vector architectures allow for:  Reduced control complexity  Efficiently utilize large number of computational resources  Potential for automatic discovery of parallelism  Can some of these features can be added to modern micros? However: most effective if sufficient regularity discoverable in program structure Suffers even if small % of code non-vectorizable (Amdahl’s Law)

5 ES Processor Overview  Cachless Architecture  8 CPU per SMP  8 way replicated vector pipes  72 vec registers, 256 64-bit words  Divide Unit  32 GB/s pipe to FPLRAM  4-way superscalar o-o-o @ 1 Gflop  64KB I$ & D$  ES: specially developed FPLRAM (Full Pipelined RAM) SX6: DDR-SDRAM 128/256Mb  ES: uses IN 12.3 GB/s bi-dir btw any 2 nodes, 640 nodes SX6: uses IXS 8GB/s bi-dir btw any 2 nodes, max 128 nodes In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Stayed #1 Top 500 for an unprecedented 2.5 years!

6 Earth Simulator Overview  Machine Type : 640 nodes, each node is 8-way SMP vector processors (5120 total procs)  Machine Peak: 40TF/s (proc peak 8GF/s)  OS : Extended version of Super-UX: 64 bit Unix OS based on System V-R3  Connection structure : a single stage crossbar network (1400 miles of cable), 83,000 copper cables  7.9 TB/s aggregate switching capacity  12.3 GB/s bi-di between any two nodes  Global Barrier Counter within interconnect allows global barrier synch <3.5usec  Storage: 480 TB Disk, 1.5 PB Tape  Compilers : Fortran 90, HPF, ANSI C, C++  Batch: similar to NQS, PBS  Parallelization: vectorization processor level OpenMP, Pthreads, MPI, HPF  No Parallel I/O: Each node has separate file system  No remote access (until recently)

7 Architectural Comparison Node TypeWhereNetwork CPU/ Node Clock MHz Peak GFlop Stream BW GB/s/P Peak byte/flop MPI BW GB/s/P MPI Latency  sec Network Topology Power3NERSCColony163751.50.40.260.1316.3Fat-tree Itanium2LLNLQuadrics414005.61.10.190.253.0Fat-tree OpteronNERSCInfiniBand222004.42.30.510.596.0Fat-tree X1ORNLCustom480012.814.91.166.37.14D-Hypercube X1EORNLCustom4113018.09.70.542.95.04D-Hypercube ESESCIN81000 8.026.33.291.55.6Crossbar SX-8HLRSINX8200016.041.02.562.05.0Crossbar  Custom vectors : High Bandwidth, Flexible addressing, Many Register, vector ISA  Less control complexity, use many ALUs, auto parallelism  ES shows best balance between memory and peak performance  Data caches of superscalar systems and X1(E) potential reduce mem costs  X1E: 2 MSP’s per MCM - increases contention for memory and interconnect  A key ‘balance point’ for vector systems is the scalar:vector ratio  Opteron/IB shows best balance for superscalar, Itanium2/Quadrics lowest latency  Cost is a critical metric - however we are unable to provide such data  Proprietary, pricing varies based on customer and time frame  Poorly balanced systems cannot solve certain class of problems/resolutions regardless of processor count!

8 Application Overview NAMEDisciplineProblem/MethodStructure CactusAstrophysicsTheory of GR ADM-BSSN, Method of Lines Grid LBMHDPlasma PhysicsMagneto-Hydrodyamics, Lattice-Boltzmann Lattice/Grid MADCAPCosmologyCosmic Microwave Background, Dense Linear Algebra, high I/O Dense Matrix GTCMagnetic FusionParticle in Cell, Vlasov-Poisson Particle/Grid PARATECMaterial ScienceDensity Functional Theory, Kohn Shan, 3DFFT Fourier/Grid FVCAMClimate ModelingAGCM,Finite Volume, Navier-Stokes, FFT Grid  Examining candidate ultra-scale applications with abundant data parallelism  Codes designed for superscalar architectures, required vectorization effort  ES use requires minimum vectorization and parallelization hurdles

9 Astrophysics: CACTUS  Numerical solution of Einstein’s equations from theory of general relativity  Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms  CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes  Evolves PDE’s on regular grid using finite differences Visualization of grazing collision of two black holes Developed at Max Planck Institute, vectorized by John Shalf LBNL

10 CACTUS: Performance  SX8 attains highest per-processor performance ever attained for Cactus  ES achieves highest overall performance and efficiency to date: 39X faster than Power3!  Vector performance related to x-dim (vector length)  Excellent scaling on ES using fixed data size per proc (weak scaling)  Opens possibility of computations at unprecedented scale  X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector  Unvectorized boundary, required 15% of runtime on ES and 30+% on X1  < 5% for the scalar version: unvectorized code can quickly dominate cost  Poor superscalar performance despite high computational intensity  Register spilling due to large number of loop variables  Prefetch engines inhibited due to multi-layer ghost zones calculations Problem Size P Power 3 Seaborg Itanium2 Thunder X1 Phoenix SX6* ES SX8 GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%peak 250x80x80 per processor 160.106%0.5810%0.816%2.835%4.327% 640.086%0.5610%0.726%2.734% 2560.075%0.5510%0.685%2.734%

11 Plasma Physics: LBMHD  LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)  Performs 2D/3D simulation of high temperature plasma  Evolves from initial conditions and decaying to form current sheets  Spatial grid coupled to octagonal streaming lattice  Block distributed over processor grid  Main computational components:  Collision, Stream, Interpolation  Vectorization: loop interchange, unrolling  HPCS Benchmark Ported by Jonathan Carter, developed by George Vahala’s group College of William & Mary Evolution of vorticity into turbulent structures

12 LBMHD-3D: Performance Grid Size P Power3 Seaborg Itanium2 Thunder Opteron Jacquard X1 Phoenix X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 256 3 160.149%0.265%0.7016%5.241%6.637%5.569%7.949% 512 3 640.159%0.356%0.6815%5.241%5.832%5.366%8.151% 1024 3 2560.149%0.326%0.6014%5.241%6.033%5.568%9.660% 2048 3 512 0.14 9%0.356%0.5913%5.832%5.265%  Not unusual to see vector achieve > 40% peak while superscalar architectures achieve ~ 10%  There exists plenty of computation, however large working set causes register spilling scalars  Unlike superscalar approach, large vector register sets hide latency  Opteron shows impressive superscalar performance, 2X speed vs. Itanium2  Opteron has >2x STREAM BW, and Itanium2 cannot store FP in L1 cache  ES sustains 68% of peak up to 4800 processors: 26TFlops -SC2005 GB Finalizst The highest performance ever attained for this code by far  SX8 shows highest raw performance, but lags behind ES in terms of efficiency  SX8: Commodity DDR2-SDRAM vs. ES: high performance custom FPLRAM  X1E achieved same performance as X1 using original code version  By turning off caching resulted in about 10% improvement over X1

13 Magnetic Fusion: GTC  Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)  Goal magnetic fusion is burning plasma power plant producing cleaner energy  GTC solves 3D gyroaveraged kinetic system w/ particle- in-cell approach (PIC)  PIC scales N instead of N 2 – particles interact w/ electromagnetic field on grid  Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)  Vectorization inhibited: multiple particles may attempt to concurrently update same point in charge deposition Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies Developed at PPPL, vectorized/optimized by Stephane Ethier and ORNL/Cray/NEC

14 GTC Particle Decomposition Vectorization:  Particle charge deposited amongst nearest grid points  Several particles can contribute to same grid point preventing vectorization  Solution: VLEN copies of charge deposition array with reduction after main loop  Greatly increases memory footprint (8x)  GTC originally optimized for superscalar SMPs using MPI/OpenMP  However: OpenMP severely increase memory for vector implementation  Vectorization and thread-level parallelism compete w/ each other  Previous vector experiments limited to only 64-way MPI parallelism  64 is optimal domains for 1D toroidal (independent of # particles)  Updated GTC algorithm  Introduces a third level of parallelism:  Algorithm splits particles between several processors (within 1D domain)  Allows increase concurrency and number of studied particles  Larger particle simulations allow increase resolution studies  Particles not subject to Courant condition (same timestep)  Allows multiple species calculations

15 GTC: Performance  New decomposition algorithm efficiently utilizes high P (as opposed to 64 on ES)  Breakthrough of Tflop barrier on ES for important SciDAC code  7.2 Tflop/s on 4096 processors  SX8 highest raw performance (of any arch) but lower efficiency than ES  Opens possibility of new set of high-phase space-resolution simulations  Scalar architectures suffer from low CI, irregular data access, and register spilling  Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1  Opteron: on-chip memory controller and caching of FP L1 data  X1 suffers from overhead of scalar code portions  Original (unmodified) X1 version performed 12% *slower* on X1E  Additional optimizations increased performance by 50%!  Recent ORNL work increases performance additional 75%  SciDAC code, HPCS benchmark P Part/ Cell Power3 Seaborg Itanium2 Thunder Opteron Jacquard X1 Phoenix X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 1282000.149%0.397%0.5913%1.29%1.710%1.923%2.314% 2564000.149%0.397%0.5713%1.29%1.710%1.822%2.315% 5128000.149%0.387%0.5112%1.79%1.822% 10241600 0.14 9%0.377%1.822%

16 Material Science: PARATEC  First-principles quantum mechanical total energy calc using pseudopotentials & plane wave basis set  Density Functional Theory to calc structure & electronic properties of new materials  DFT calc are one of the largest consumers of supercomputer cycles in the world  33% 3D FFT, 33% BLAS3, 33% Hand coded F90  Part of calculation in real space other in Fourier space  Uses specialized 3D FFT to transform wavefunction Conduction band minimum electron state for Cadmium Selenide (CdSe) quantum dot Developed by Andrew Canning with Louie and Cohen’s groups (LBNL, UCB)  Global transpose Fourier to real space Multiple  FFTs: reduce communication latency  Vectorize across (not within) FFTs  Custom FFT: only nonzeros sent

17 PARATEC: Performance  All architectures generally perform well due to computational intensity of code (BLAS3, FFT)  ES achieves highest overall performance to date: 5.5Tflop/s on 2048 procs  Main ES advantage for this code is fast interconnect  Allows never before possible, high resolution simulations  CdSE Q-dot: Largest cell-size atomistic experiment ever run using PARATEC Important uses: can be attached to organic molecules as electronic dye tags  SX8 achieves highest per-processor performance  X1/X1E shows lowest % of peak  Non-vectorizable code much more expensive on X1/X1E (32:1)  Lower bisection bandwidth to computational ratio (4D-hypercube)  X1 Performance is comparable to Itanium2  Itanium2 outperforms Opteron (unlike LBMHD/GTC) because  PARATEC less sensitive to memory access issues (high computational intensity)  Opteron lacks FMA unit  Quadrics shows better scaling of all-to-all at large concurrencies ProblemP Power3 Seaborg Itanium2 Thunder Opteron Jacquard X1 Phoenix X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGF/s/P%pkGFs/P%pkGFs/P%pk 488 Atom CdSe Quantum Dot 1280.9362%2.851%3.225%3.821%5.164%7.549% 2560.8567%2.647%2.045%3.024%3.318%5.062%6.843% 5120.7349%2.444%1.022%2.212%4.455% 1024 0.60 40%1.832%3.646%

18 Atmospheric Modeling: FVCAM  CAM3.1: Atmospheric component of CCSM3.0  Our focus is the Finite Volume (FV) approach  AGCM: consists of physics (PS) and dynamical core (DC)  DC approximates dynamics of atmosphere  PS: calculates source terms to equations of motion:  Turbulence, radiative transfer, clouds, boundary, etc  DC default: Eulerian spectral transform - maps onto sphere  Allows only 1D decomposition in latitude  FVCAM grid is rectangular (longitude, latitude, level)  Allows 2D decomp (latitude, level) in dynamics phase  Singularity at pole prevents longitude decomposition  Dynamical eqns bounded by Lagrangian surfaces  Requires remapping to Eulerian reference frame  Hybrid (MPI/OpenMP) programming  MPI tasks limited by # of latitude lines  Increase potential parallelism and improves S2V ratio  Is not always effective on some platforms Experiments/vectorization Art Mirin, Dave Parks, Michael Wehner, Pat Worley Simulated Class IV hurricane at 0.5 . This storm was produced solely through the chaos of the atmospheric model. It is one of the many events produced by FVCAM at resolution of 0.5 .

19 FVCAM Decomposition and Vectorization  Processor communication topology and volume for 1D and 2D FVCAM  Generated by IPM profiling tool - used to understand interconnect requirements  1D approach straightforward nearest neighbor communication  2D communication bulk is nearest neighbor - however:  Complex pattern due to vertical decomp and transposition during remapping  Total volume in 2D remap is reduced due to improved surface/volume ratio  Vectorization - 5 routines, about 1000 lines  Move latitude calculation to inner loops to maximize parallelism  Reduce number of branches, performing logical tests in advance (indirect indexing)  Vectorize across (not within) FFT’s for Polar filters  However, higher concurrency for fixed size problem limit performance of vectorized FFTs

20 FVCAM3.1: Performance  FVCAM 2D decomposition allows effective use of >2X as many processors  Increasing vertical discretizations (1,4,7) allows higher concurrencies  Results showing high resolution vector performance 361x576x26 (0.5  x 0.625)  X1E achieves speedup of over 4500 on P=672 - highest ever achieved  Power3 limited to speedup of 600 regardless of concurrency  Factor of at least 1000x necessary for simulation to be tractable  Raw speed X1E: 1.14X X1, 1.4X ES, 3.7X Thunder, 13X Seaborg  At high concurrencies (P= 672) all platforms achieve low % peak (< 7%)  ES achieves highest sustained performance (over 10% at P=256)  Vectors suffer from short vector length of fixed problem size, esp FFTs  Superscalars generally achieve lower efficiencies/performance than vectors  Finer resolutions requires increased number of more powerful processors

21 Cosmology: MADCAP  Anisotropy Dataset Computational Analysis Package  Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)  Anisotropies in the CMB contains early history of the Universe  Recasts problem in dense linear algebra: ScaLAPACK  Out of core calculation: holds ~3 of the 50 matrices in memory Temperature anisotropies in CMB (Boomerang ) Developed by Julian Borrill, LBNL

22 MADCAP: Performance  Overall performance can be surprising low, for dense linear algebra code  I/O takes a heavy toll on Phoenix and Columbia: I/O optimization currently in progress  NERSC Power3 shows best system balance with respect to I/O  ES lacks high-performance parallel I/O - code rewritten to utilize local disks  HPCS I/O candidate benchmark Number Pixels P NERSC (Power3)Columbia (Itan2)Phoenix (X1)ES (SX6 * ) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 10K640.7349%1.220%2.217%2.937% 20K2560.7651%1.119%0.65%4.050% 40K10240.7550%4.658%

23 Performance Overview  Tremendous potential of vector systems:  ES achieves unprecedented aggregate performance on almost all test applications  LBMHD-3D achieves 26 TF/s using 4800 ES procs (68% of peak) - GB finalist  GTC achieves 7.2 TF/s on 4096 ES processors  PARATEC achieves 5.5 TF/s on 2048 processors of ES  ES consistently achieves highest efficiency  Investigating how to incorporate vector-like facilities into modern micros  To date we have mostly looked at mostly regularly structured algorithms  Many methods unexamined (more difficult to vectorize): Implicit, Multigrid, AMR, Unstructured (Adaptive), Iterative and Direct Sparse Solvers, N-body, Fast Multiple  One size does not fit all: need a range of architectural designs to best fit algorithms % of Theoretical PeakSpeedup vs. ES


Download ppt "C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National."

Similar presentations


Ads by Google