Presentation is loading. Please wait.

Presentation is loading. Please wait.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley.

Similar presentations


Presentation on theme: "C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley."— Presentation transcript:

1 C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley National Laboratory Joint work with: Julian Borrill, Andrew Canning, Stephane Ethier, Art Mirin, David Parks, John Shalf, David Skinner, Michael Wehner, Patrick Worley

2 Motivation  Stagnating application performance is well-know problem in scientific computing  By end of decade numerous mission critical applications expected to have 100X computational demands of current levels  Many HEC platforms are poorly balanced for demands of leading applications  Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology  Traditional superscalar trends slowing down  Mined most benefits of ILP and pipelining, Clock frequency limited by power concerns  Achieving next order of magnitude in computing will require O(100K) - O(1M) processors: interconnect that scale nonlinearly in cost (crossbar, fattree) become impractical

3 Application Evaluation  Microbenchmarks, algorithmic kernels, performance modeling and prediction, are important components of understanding and improving architectural efficiency  However full-scale application performance is the final arbiter of system utility and necessary as baseline to support all complementary approaches  Our evaluation work emphasizes full applications, with real input data, at the appropriate scale  Requires coordination of computer scientists and application experts from highly diverse backgrounds  Allows advanced optimizations for given architectural platforms  In-depth studies reveal limitation of compilers, operating systems, and hardware, since all of these components must work together at scale  Our recent efforts have focused on the potential of modern parallel vector systems  Effective code vectorization is an integral part of the process  First US team to conduct Earth Simulator performance study

4 Vector Paradigm  High memory bandwidth  Allows systems to effectively feed ALUs (high byte to flop ratio)  Flexible memory addressing modes  Supports fine grained strided and irregular data access  Vector Registers  Hide memory latency via deep pipelining of memory load/stores  Vector ISA  Single instruction specifies large number of identical operations  Vector architectures allow for:  Reduced control complexity  Efficiently utilize large number of computational resources  Potential for automatic discovery of parallelism  Can some of these features can be added to modern micros? However: most effective if sufficient regularity discoverable in program structure Suffers even if small % of code non-vectorizable (Amdahl’s Law)

5 Architectural Comparison Node TypeWhereNetwork CPU/ Node Clock MHz Peak GFlop Stream BW GB/s/P Peak byte/flop MPI BW GB/s/P MPI Latency  sec Network Topology Power3NERSCColony163751.50.40.260.1316.3Fat-tree Itanium2LLNLQuadrics414005.61.10.190.253.0Fat-tree OpteronNERSCInfiniBand222004.42.30.510.596.0Fat-tree X1ORNLCustom480012.814.91.166.37.14D-Hypercube X1EORNLCustom4113018.09.70.542.95.04D-Hypercube ESESCIN81000 8.026.33.291.55.6Crossbar SX-8HLRSINX8200016.041.02.562.05.0Crossbar  Custom vectors : High Bandwidth, Flexible addressing, Many Register, vector ISA  Less control complexity, use many ALUs, auto parallelism  ES shows best balance between memory and peak performance  Data caches of superscalar systems and X1(E) potential reduce mem costs  X1E: 2 MSP’s per MCM - increases contention for memory and interconnect  A key ‘balance point’ for vector systems is the scalar:vector ratio (8:1 or even 32:1)  Opteron/IB shows best balance for superscalar, Itanium2/Quadrics lowest latency  Cost is a critical metric - however we are unable to provide such data  Proprietary, pricing varies based on customer and time frame  Poorly balanced systems cannot solve certain class of problems/resolutions regardless of processor count!

6 Application Overview NAMEDisciplineProblem/MethodStructure CactusAstrophysicsTheory of GR ADM-BSSN, Method of Lines Grid LBMHDPlasma PhysicsMagneto-Hydrodyamics, Lattice-Boltzmann Lattice/Grid MADCAPCosmologyCosmic Microwave Background, Dense Linear Algebra, high I/O Dense Matrix GTCMagnetic FusionParticle in Cell, Vlasov-Poisson Particle/Grid PARATECMaterial ScienceDensity Functional Theory, Kohn Shan, 3DFFT Fourier/Grid FVCAMClimate ModelingAGCM,Finite Volume, Navier-Stokes, FFT Grid  Examining candidate ultra-scale applications with abundant data parallelism  Codes designed for superscalar architectures, required vectorization effort  ES use requires minimum vectorization and parallelization hurdles

7 Astrophysics: CACTUS  Numerical solution of Einstein’s equations from theory of general relativity  Among most complex in physics: set coupled nonlinear hyperbolic & elliptic systems w/ thousands of terms  CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes  Evolves PDE’s on regular grid using finite differences  Gravitational Wave Astronomy: exciting new field about to be born, w/ new fundamental insight into universe Visualization of grazing collision of two black holes Developed at Max Planck Institute, vectorized by John Shalf LBNL

8 CACTUS: Performance  SX8 attains highest per-processor performance ever attained for Cactus: 58X Power3!  Vector performance related to x-dim (vector length)  Excellent scaling on ES/SX8 using fixed data size per proc (weak scaling)  Opens possibility of computations at unprecedented scale  X1 surprisingly poor (~6x slower than SX8) - low ratio scalar:vector  Unvectorized boundary, required 15% of runtime on SX8 and 30+% on X1  < 5% for the scalar version: unvectorized code can quickly dominate cost  To date unable to successfully execute code on X1E  Poor superscalar performance despite high computational intensity  Register spilling due to large number of loop variables  Prefetch engines inhibited due to multi-layer ghost zones calculations Problem Size P Power 3 Seaborg Itanium2 Thunder Opteron Jaquard X1 Phoenix SX6* ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 250x80x80 per processor 160.106%0.5810%0.8219%0.816%2.835%4.327% 640.086%0.5610%0.7221%0.726%2.734%4.025% 2560.075%0.5510%0.6816%0.685%2.734%3.924%

9 Plasma Physics: LBMHD  LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)  Performs 2D/3D simulation of high temperature plasma  Evolves from initial conditions and decaying to form current sheets  Spatial grid coupled to octagonal streaming lattice  Block distributed over processor grid  Main computational components:  Collision, Stream, Interpolation  Vectorization: loop interchange, unrolling  X1 compiler loop reordered automatically Ported by Jonathan Carter, developed by George Vahala’s group College of William & Mary Evolution of vorticity into turbulent structures

10 LBMHD-3D: Performance Grid Size P Power3 Seaborg Itanium2 Thunder Opteron Jacquard X1 Phoenix X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 256 3 160.149%0.265%0.7016%5.241%6.637%5.569%7.949% 512 3 640.159%0.356%0.6815%5.241%5.832%5.366%8.151% 1024 3 2560.149%0.326%0.6014%5.241%6.033%5.568%9.660% 2048 3 512 0.14 9%0.356%0.5913%5.832%5.265%  Not unusual to see vector achieve > 40% peak while superscalar architectures achieve ~ 10%  There exists plenty of computation, however large working set causes register spilling scalars  Examining cache-oblivious algorithms for cache-based systems  Unlike superscalar approach, large vector register sets hide latency  Opteron shows impressive superscalar performance, 2X speed vs. Itanium2  Opteron has >2x STREAM BW, and Itanium2 cannot store FP in L1 cache  ES sustains 68% of peak up to 4800 processors: 26TFlops - SC05 Gorden Bell Finalist The highest performance ever attained for this code by far  SX8 shows highest raw performance, but ES shows highest efficiency  SX8: Commodity DDR2-SDRAM vs. ES: high performance custom FPLRAM  X1E achieved same performance as X1 using original code version  By turning off caching resulted in about 10% improvement over X1

11 Analysis of Bank Conflicts  Comparison of bank conflicts shows that ES custom memory has clear advantages CodeP Grid Size Bank Conflict ESSX8 MHD2D648192 2 0.3%16.6% MHD2D2568192 2 0.3%10.7% MHD3D64256 3 0.01%1.1% MHD3D256512 3 0.01%1.2%

12 Magnetic Fusion: GTC  Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)  Goal magnetic fusion is burning plasma power plant producing cleaner energy  GTC solves 3D gyroaveraged kinetic system w/ particle- in-cell approach (PIC)  PIC scales N instead of N 2 – particles interact w/ electromagnetic field on grid  Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)  Vectorization inhibited: multiple particles may attempt to concurrently update same point in charge deposition Whole volume and cross section of electrostatic potential field, showing elongated turbulence eddies Developed at PPPL, vectorized/optimized by Stephane Ethier and ORNL/Cray/NEC

13 GTC Particle Decomposition Vectorization:  Particle charge deposited amongst nearest grid points  Several particles can contribute to same grid point preventing vectorization  Solution: VLEN copies of charge deposition array with reduction after main loop  Greatly increases memory footprint (8x)  GTC originally optimized for superscalar SMPs using MPI/OpenMP  However: OpenMP severely increase memory for vector implementation  Vectorization and thread-level parallelism compete w/ each other  Previous vector experiments limited to only 64-way MPI parallelism  64 is optimal domains for 1D toroidal (independent of # particles)  Updated GTC algorithm  Introduces a third level of parallelism:  Algorithm splits particles between several processors (within 1D domain)  Allows increase concurrency and number of studied particles  Larger particle simulations allow increase resolution studies  Particles not subject to Courant condition (same timestep)  Allows multiple species calculations

14 GTC: Performance  New decomposition algorithm efficiently utilizes high P (as opposed to 64 on ES)  Breakthrough of Tflop barrier on ES for important SciDAC code  7.2 Tflop/s on 4096 ES processors  SX8 highest raw performance (of any architecture)  Opens possibility of new set of high-phase space-resolution simulations  Scalar architectures suffer from low CI, irregular data access, and register spilling  Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of X1  Opteron: on-chip memory controller and caching of FP L1 data  X1 suffers from overhead of scalar code portions  Original (unmodified) X1 version performed 12% *slower* on X1E  Additional optimizations increased performance by 50%!  Recent ORNL work increases performance additional 75%  SciDAC code, HPCS benchmark P Part/ Cell Power3 Seaborg Itanium2 Thunder Opteron Jacquard X1 Phoenix X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pk 1282000.149%0.397%0.5913%1.29%1.710%1.923%2.314% 2564000.149%0.397%0.5713%1.29%1.710%1.822%2.315% 5128000.149%0.387%0.5112%1.79%1.822% 10241600 0.14 9%0.377%1.822%

15 Material Science: PARATEC  First-principles quantum mechanical total energy calc using pseudopotentials & plane wave basis set  Density Functional Theory to calc structure & electronic properties of new materials  DFT calc are one of the largest consumers of supercomputer cycles in the world  33% 3D FFT, 33% BLAS3, 33% Hand coded F90  Part of calculation in real space other in Fourier space  Uses specialized 3D FFT to transform wavefunction Conduction band minimum electron state for Cadmium Selenide (CdSe) quantum dot Developed by Andrew Canning with Louie and Cohen’s groups (LBNL, UCB)  Global transpose Fourier to real space Multiple  FFTs: reduce communication latency  Vectorize across (not within) FFTs  Custom FFT: only nonzeros sent

16 PARATEC: Performance  All architectures generally perform well due to computational intensity of code (BLAS3, FFT)  ES achieves highest overall performance to date: 5.5Tflop/s on 2048 procs  Main ES/SX8 advantage for this code is fast interconnect  Allows never before possible, high resolution simulations  CdSE Q-dot: Largest cell-size atomistic experiment ever run using PARATEC Important uses: can be attached to organic molecules as electronic dye tags  SX8 achieves highest per-processor performance (>2x X1E)  X1/X1E shows lowest % of peak  Non-vectorizable code much more expensive on X1/X1E (32:1)  Lower bisection bandwidth to computational ratio (4D-hypercube)  X1 performance is comparable to Itanium2  Itanium2 outperforms Opteron (unlike LBMHD/GTC) because  PARATEC less sensitive to memory access issues (high computational intensity)  Opteron lacks FMA unit  Quadrics shows better scaling of all-to-all at large concurrencies ProblemP Power3 Seaborg Itanium2 Thunder Opteron Jacquard X1 Phoenix X1E Phoenix SX6 ES SX8 HLRS GFs/P%pkGFs/P%pkGFs/P%pkGFs/P%pkGF/s/P%pkGFs/P%pkGFs/P%pk 488 Atom CdSe Quantum Dot 1280.9362%2.851%3.225%3.821%5.164%7.549% 2560.8567%2.647%2.045%3.024%3.318%5.062%6.843% 5120.7349%2.444%1.022%2.212%4.455% 1024 0.60 40%1.832%3.646%

17 Atmospheric Modeling: FVCAM  CAM3.1: Atmospheric component of CCSM3.0  Our focus is the Finite Volume (FV) approach  AGCM: consists of physics (PS) and dynamical core (DC)  DC approximates dynamics of atmosphere  PS: calculates source terms to equations of motion:  Turbulence, radiative transfer, clouds, boundary, etc  DC default: Eulerian spectral transform - maps onto sphere  Allows only 1D decomposition in latitude  FVCAM grid is rectangular (longitude, latitude, level)  Allows 2D decomp (latitude, level) in dynamics phase  Singularity at pole prevents longitude decomposition  Dynamical eqns bounded by Lagrangian surfaces  Requires remapping to Eulerian reference frame  Hybrid (MPI/OpenMP) programming  MPI tasks limited by # of latitude lines  Increase potential parallelism and improves S2V ratio  Is not always effective on some platforms Experiments/vectorization Art Mirin, Dave Parks, Michael Wehner, Pat Worley Simulated Class IV hurricane at 0.5 . This storm was produced solely through the chaos of the atmospheric model. It is one of the many events produced by FVCAM at resolution of 0.5 .

18 FVCAM Decomposition and Vectorization  Processor communication topology and volume for 1D and 2D FVCAM  Generated by IPM profiling tool - used to understand interconnect requirements  1D approach straightforward nearest neighbor communication  2D communication bulk is nearest neighbor - however:  Complex pattern due to vertical decomp and transposition during remapping  Total volume in 2D remap is reduced due to improved surface/volume ratio  Vectorization - 5 routines, about 1000 lines  Move latitude calculation to inner loops to maximize parallelism  Reduce number of branches, performing logical tests in advance (indirect indexing)  Vectorize across (not within) FFT’s for Polar filters  However, higher concurrency for fixed size problem limit performance of vectorized FFTs

19 FVCAM3.1: Performance  FVCAM 2D decomposition allows effective use of >2X as many processors  Increasing vertical discretizations (1,4,7) allows higher concurrencies  Results showing high resolution vector performance 361x576x26 (0.5  x 0.625)  X1E achieves speedup of over 4500 on P=672 - highest ever achieved  To date only results from 1D decomp available on SX-8  Power3 limited to speedup of 600 regardless of concurrency  Factor of at least 1000x necessary for simulation to be tractable  Raw speed X1E: 1.14X X1, 1.4X ES, 3.7X Thunder, 13X Seaborg  At high concurrencies (P= 672) all platforms achieve low % peak (< 7%)  ES achieves highest sustained performance (over 10% at P=256)  Vectors suffer from short vector length of fixed problem size, esp FFTs  Superscalars generally achieve lower efficiencies/performance than vectors  Finer resolutions requires increased number of more powerful processors

20 Cosmology: MADCAP  Anisotropy Dataset Computational Analysis Package  Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)  Anisotropies in the CMB contains early history of the Universe  Recasts problem in dense linear algebra: ScaLAPACK  Out of core calculation: holds ~3 of the 50 matrices in memory Temperature anisotropies in CMB (Boomerang ) Developed by Julian Borrill, LBNL

21 MADCAP: Performance  Overall performance can be surprising low, for dense linear algebra code  I/O takes a heavy toll on Phoenix and Columbia: I/O optimization currently in progress  NERSC Power3 shows best system balance with respect to I/O  ES lacks high-performance parallel I/O - code rewritten to utilize local disks  Preliminary SX8 experiments unsuccessful due to poor I/O performance Number Pixels P NERSC (Power3)Columbia (Itan2)Phoenix (X1)ES (SX6 * ) Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 10K640.7349%1.220%2.217%2.937% 20K2560.7651%1.119%0.65%4.050% 40K10240.7550%4.658%

22 Performance Overview  Tremendous potential of vector systems: SX8 achieves unprecedented raw performance  Demonstrated importance of vector platforms: allows solutions at unprecedented scale  ES achieves highest efficiency on almost all test applications  LBMHD-3D achieves 26 TF/s using 4800 procs, GTC achieves 7.2 TF/s on 4096 ES procs, PARATEC achieves 5.5 TF/s on 2048 procs  Porting several of our apps to SX8 in progress: Madbench (slow I/O), FVCAM 2D decomp, Hyperclaw AMR (templated C++)  Investigating how to incorporate vector-like facilities into modern micros  Many methods unexamined (more difficult to vectorize): Implicit, Multigrid, AMR, Unstructured (Adaptive), Iterative and Direct Sparse Solvers, N-body, Fast Multiple  One size does not fit all: need a range of architectural designs to best fit algorithms  Heterogeneous supercomputing designs are on the horizon  Moving on to latest generation of HEC platforms: Power5, BG/L, XT3 % of Theoretical Peak Speedup vs. ES


Download ppt "C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley."

Similar presentations


Ads by Google