Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Andrew Canning and Lin-Wang Wang Computational Research Division LBNL
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on Modern Parallel Systems Leonid Oliker Lawrence Berkeley.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Leading Computational Methods on the Earth Simulator Leonid Oliker Lawrence Berkeley National.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Andrew Canning, Jonathan Carter, John Shalf Lawrence.
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Computer Staff Scientist Future Technologies Group Computational Research Division.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Outline Why this subject? What is High Performance Computing?
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Unstructured Meshing Tools for Fusion Plasma Simulations
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Memory COMPUTER ARCHITECTURE
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence Berkeley National Laboratories Stephane Ethier Princeton Plasma Physics Laboratory

Overview  Superscalar cache-based architectures dominate HPC market  Leading architectures are commodity-based SMPs due to generality and perception of cost effectiveness  Growing gap between peak & sustained performance is well known in scientific computing  Modern parallel vectors may bridge gap this for many important applications  In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps  Conducting evaluation study of scientific applications on modern vector systems  09/2003 MOU between ES and NERSC was completed First visit to ES center: December 8 th -17 th, 2003 (ES remote access not available) First international team to conduct performance evaluation study at ES  Examining best mapping between demanding applications and leading HPC systems - one size does not fit all

Vector Paradigm  High memory bandwidth Allows systems to effectively feed ALUs (high byte to flop ratio)  Flexible memory addressing modes Supports fine grained strided and irregular data access  Vector Registers Hide memory latency via deep pipelining of memory load/stores  Vector ISA Single instruction specifies large number of identical operations  Vector architectures allow for: Reduced control complexity Efficiently utilize large number of computational resources Potential for automatic discovery of parallelism However: most effective if sufficient regularity discoverable in program structure Suffers even if small % of code non-vectorizable (Amdahl’s Law)

Architectural Comparison Node TypeWhere CPU/ Node Clock MHz Peak GFlop Mem BW GB/s Peak byte/flop Netwk BW GB/s/P Bisect BW byte/flop MPI Latency usec Network Topology Power3NERSC Fat-tree Power4ORNL Fat-tree AltixORNL Fat-tree ESESC Crossbar X1ORNL D-torus  Custom vector architectures have High memory bandwidth relative to peak Superior interconnect: latency, point to point, and bisection bandwidth  Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architectures  A key ‘balance point’ for vector systems is the scalar:vector ratio

Memory Performance Triad Mem Test: A(i) = B(i) + s*C(i) NO Machine Specific Optimizations For strided access, SX6 achieves 10X, 100X, 1000X improvement over X1, Pwr4, Pwr3 For gather/scatter, SX6/X1 show similar performance, exceed scalar at higher data sizes All machines performance can be improved via architecture specific optimizations Example: On X1 using non-cachable & unroll(4) pragma improves strided BW by 20X

Analysis using ‘Architectural Probe’  Tunable parameters to mimic behavior of important scientific kernel Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! What % of memory access can be random before performance decreases by half? How much computational intensity is required to hide the penalty of all random access?

Sample Kernel Performance NPB FT Class BNbody (Barnus-Hut)  FFT computationally intensive with data parallel operations  Well suited for vectorization: 17X, 4X faster than Power3/4  Fixed cost interprocessor communication hurts scalability  Nbody requires irregular, unstructured data access, control flow and communication  Poorly suited for vectorization: 2X and 5X slower than Power/4  Vector architectures may not be suitable for all classes of applications

Applications studied LBMHD Plasma Physics 1,500 lines grid based Lattice Boltzmann approach for magneto-hydrodynamics CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation MADbench Cosmology 2,000 lines dense linear algebra Maximum likelihood two-point angular correlation, I/O intensive A pplications chosen with potential to run at ultrascale  Computations contain a bundant data parallelism ES runs require minimum parallelization and vectorization hurdles  Codes originally designed for superscalar systems  Ported onto single node of SX6, first multi-node experiments performed at ESC

Plasma Physics: LBMHD  LBMHD uses a Lattice Boltzmann method to model magneto-hydrodynamics (MHD)  Performs 2D simulation of high temperature plasma  Evolves from initial conditions and decaying to form current sheets  2D spatial grid is coupled to octagonal streaming lattice  Block distributed over 2D processor grid  Main computational components: Collision requires coefficients for local gridpoint only, no communication Stream values at gridpoints are streamed to neighbors, at cell boundaries information is exchanged via MPI Interpolation step required between spatial and stream lattices  Developed George Vahala’s group College of William and Mary, ported Jonathan Carter Current density decays of two cross- shaped structures

LBMHD: Porting Details  Collision routine rewritten: For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather than velocity or magnetic field loops (~10 iterations) X1 compiler made this transformation automatically: multistreaming outer loop and vectorizing (via strip mining) inner loop Temporary arrays padded reduce bank conflicts  Stream routine performs well: Array shift operations, block copies, 3 rd -degree polynomial eval  Boundary value exchange MPI_Isend, MPI_Irecv pairs Further work: plan to use ES "global memory" to remove message copies (left) octagonal streaming lattice coupled with square spatial grid (right) example of diagonal streaming vector updating three spatial cells

LBMHD: Performance  ES achieves highest performance to date: over 3.3 Tflops for P=1024 X1 comparable absolute speed up to P=64 (lower % peak) But performs 1.5X slower at P=256 (decreased scalability)  CAF improved X1 to slightly exceed ES at P=64 (up to 4.70 Gflop/P)  ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix Low CI (1.5) and high memory requirement (30GB) hurt scalar performance  Altix best scalar due to: high memory bandwidth, fast interconnect Data Size P Power 3Power4AltixESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 4096 x %0.285%0.6010%4.658%4.334% %0.306%0.6210%4.354%4.434% %0.285% % x %0.275%0.6511%4.658%4.535% %0.285% %2.721% % %---

LBMHD on X1 MPI vs CAF  X1 well-suited for one-sided parallel languages (globally addressable mem) MPI hinders this feature and requires scalar tag matching  CAF allows much simpler coding of boundary exchange (array subscripting): feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj]  MPI requires non-contiguous data copies into buffer, unpacked at destination  Since communication about 10% of LBMHD, only slight improvements  However, for P=64 on performance degrades. Tradeoffs : CAF reduced total message volume 3X (eliminates user and system buffer copy) But CAF used more numerous and smaller sized message Data Size P X1-MPIX1-CAF Gflops/P%peakGflops/P%peak %4.5536% %4.2633% %4.7037% %2.9123%

Astrophysics: CACTUS  Numerical solution of Einstein’s equations from theory of general relativity  Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms  CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes  Evolves PDE’s on regular grid using finite differences  Uses ADM formulation: domain decomposed into 3D hypersurfaces for different slices of space along time dimension  Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe  Gravitational waves: Ripples in spacetime curvature, caused by matter motion, causing distances to change.  Developed at Max Planck Institute, vectorized by John Shalf Visualization of grazing collision of two black holes Communication at boundaries Expect high parallel efficiency

CACTUS: Performance  ES achieves fastest performance to date: 45X faster than Power3! Vector performance related to x-dim (vector length) Excellent scaling on ES using fixed data size per proc (weak scaling) Scalar performance better on smaller problem size (cache effects)  X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector  Unvectorized boundary, required 15% of runtime on ES and 30+% on X1  < 5% for the scalar version: unvectorized code can quickly dominate cost  Poor superscalar performance despite high computational intensity  Register spilling due to large number of loop variables  Prefetch engines inhibited due to multi-layer ghost zones calculations Problem Size P Power 3Power 4AltixESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 80X80x80 per processor %0.5811%0.8915%1.518%0.544% %0.5010%0.7012%1.417%0.433% %0.489% %0.413% 250x80x80 per processor %0.5611%0.519%2.835%0.816% % %2.734%0.726% % %0.685%

Material Science: PARATEC  PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set  Density Functional Theory to calc structure & electronic properties of new materials  DFT calc are one of the largest consumers of supercomputer cycles in the world Induced current and charge density in crystallized glycine  Uses all-band CG approach to obtain wavefunction of electrons  33% 3D FFT, 33% BLAS3, 33% Hand coded F90  Part of calculation in real space other in Fourier space Uses specialized 3D FFT to transform wavefunction  Computationally intensive - generally obtains high percentage of peak  Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)

 Transpose from Fourier to real space  3D FFT done via 3 sets of 1D FFTs and 2 transposes  Most communication in global transpose (b) to (c) little communication (d) to (e)  Many FFTs done at the same time to avoid latency issues  Only non-zero elements communicated/calculated  Much faster than vendor 3D-FFT PARATEC: Wavefunction Transpose (a)(b) (e) (c) (f) (d)

PARATEC Scaling: ES vs. Power3  ES can run the same system about 10 times faster than the IBM SP (on any number of processors)  Main advantage of ES for these types of codes is the fast communication network  Fast processors require less fine- grain parallelism in code to get same performance as RISC machines  Vector arch allow opportunity to simulate systems not possible on scalar platforms

PARATEC: Performance Data Size P Power 3Power4AltixESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 432 Atom %2.039%3.762%4.760%3.024% %1.733%3.254%4.759%2.620% %1.529% %1.915% %1.121% % % % Atom %3.024% %1.310%  ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs  Main advantage for this type of code is fast interconnect system  X1 3.5X slower than ES (although peak is 50% higher) Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1) Lower bisection bandwidth to computation ratio  Limited scalability due to increasing cost of global transpose and reduced vector length Plan to run larger problem size next ES visit  Scalar architectures generally perform well due to high computational intensity Power3, Power4, Alitx are 8X, 4X, 1.5X slower than ES  Vector arch allow opportunity to simulate systems not possible on scalar platforms

Magnetic Fusion: GTC  Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence)  Goal magnetic fusion is burning plasma power plant producing cleaner energy  GTC solves 3D gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC)  PIC scales N instead of N 2 – particles interact w/ electromagnetic field on grid  Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs)  Main computational tasks: Scatter d eposit particle charge to nearest point Solve Poisson eqn to get potential for each point Gather c alc force based on neighbors potential Move particles by solving eqn of motion Shift particles moved outside local domain 3D visualization of electrostatic potential in magnetic fusion device Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier

GTC: Scatter operation  Particle charge deposited amongst nearest grid points.  Calculate force based on neighbors potential, then move particle accordingly  Several particles can contribute to same grid points, resulting in memory conflicts (dependencies) that prevent vectorization  Solution: VLEN copies of charge deposition array with reduction after main loop However, greatly increases memory footprint (8X)  Since particles are randomly localized - scatter also hinders cache reuse

GTC: Performance Number Particles P Power 3Power4AltixESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 10/cell 20M %0.295%0.295%1.1514%1.008% %0.325%0.264%1.0013%0.806% 100/cell 200M %0.295%0.336%1.6220%1.5012% %0.295%0.315%1.5620%1.3611% %  ES achieves fastest performance of any tested architecture! First time code achieved 20% of peak - compared with less 10% on superscalar systems Vector hybrid (OpenMP) parallelism not possible due to increased memory requirements P=64 on ES is 1.6X faster than P=1024 on Power3! Reduced scalability due to decreasing vector length, not MPI performance  Non-vectorizable code portions expensive on X1 Before vectorization shift routine accounted for 11% of ES and 54% of X1 overhead  Larger tests could not be performed at ES due to parallelization/vectorization hurdles Currently developing new version with increased particle decomposition  Advantage of ES for PIC codes may reside in higher statistical resolution simulations Greater speed allow more particles per cell

GTC: Performance  With increasing processors, and fixed problem size, the vector length decreases  Limited scaling due to decreased vector efficiency rather than communications overhead.  MPI communication by itself has near perfect scaling.

Cosmology: MADbench  Microwave Anisotropy Dataset Computational Analysis Package  Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)  CMB encodes fundamental parameters of cosmology: Universe geometry, expansion rate, number of neutrino species  Preserves full complexity of underlying scientific problem  Calculates maximum likelihood two-point angular correlation function  Recasts problem in dense linear algebra: ScaLAPACK Steps include: mat-mat, mat-vec, chol decomp, redistribution  High I/O requirement - due to out-of-core nature of calculation  Developed at NERSC/CRD by Julian Borrill

CMB analysis moves from the time domain - observations - O(10 12 ) to the pixel domain - maps - O(10 8 ) to the multipole domain - power spectra - O(10 4 ) calculating the compressed data and their reduced error bars at each step. CMB Data Analysis

MADbench: Performance  ES achieves fastest performance to date: 4.7 Tflop/s on 1024 processors!  Original ES visit: only partially ported due to code’s requirements of global file system  New version of Madbench successfully reduced I/O overhead and removed global file system requirements  Computational ScaLAPACK kernel achieves high performance on all systems  ES shows highest % peak due to balanced I/O system  X1 performance starts high, but falls quickly due to I/O overheads  Columbia performance relatively poor  Just looking at overall performance is not sufficient to understand bottlenecks PixelsP Power 3ColumbiaESX1 Gflops/P%peakGflops/P%peakGflops/P%peakGflops/P%peak 5K %1.219%2.025%4.536% 10K %1.220%2.937%2.117% 20K %1.119%4.050%0.65% 40K % %---

IPM Overview Integrated Performance Monitoring  portable, lightweight, scalable profiling  fast hash method  profiles MPI topology  profiles code regions  open source MPI_Pcontrol(1,”W”); …code… MPI_Pcontrol(-1,”W”); ########################################### # IPMv0.7 :: csnode tasks ES/ESOS # madbench.x (completed) 10/27/04/14:45:56 # # (sec) # # … ############################################### # W # (sec) # # # call [time] %mpi %wall # MPI_Reduce 2.395e # MPI_Recv 9.625e # MPI_Send 2.708e # MPI_Testall 7.310e # MPI_Isend 2.597e ############################################### …

MADBench: Performance Characterization  In depth analysis shows performance contribution of each component for evaluated architectures  Identifies system specific balance and opportunities for optimization  Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed  Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (<1%)

Overview Tremendous potential of vector architectures: 5 codes running faster than ever before  Vector systems allows resolution not possible with scalar arch (regardless of # procs)  Opportunity to perform scientific runs at unprecedented scale  ES shows high raw and much higher sustained performance compared with X1 Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) Evaluation codes contain sufficient regularity in computation for high vector performance GTC example code at odds with data-parallelism Important to characterize full application including I/O effects Much more difficult to evaluate codes poorly suited for vectorization  Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale)  Plan to expand scope of application domains/methods, and examine latest HPC architectures Code (P=64) % peak(P=Max avail) Speedup ES vs. Pwr3Pwr4AltixESX1Pwr3Pwr4AltixX1 LBMHD7%5%11%58%37% CACTUS6%11%7%34%6% GTC9%6%5%20%11% PARATEC57%33%54%58%20% MADbench49%---19%37%17% Average

EXTRA SLIDES

ES Processor Overview  8 Gflops per CPU  8 CPU per SMP  8 way replicated vector pipe  72 vec registers, bit words  Divide Unit  32 GB/s pipe to FPLRAM  4-way superscalar 1 Gflop  64KB I$ & D$  ES:640 nodes  ES: newly developed FPLRAM (Full Pipelined RAM) SX6: DDR-SDRAM 128/256Mb  ES: uses IN 12.3 GB/s bi-dir btw any 2 nodes, 640 nodes SX6: uses IXS 8GB/s bi-dir btw any 2 nodes, max 128 nodes

Earth Simulator Overview  Machine type : 640 nodes, each node is 8-way SMP vector processors (5120 total procs)  Machine Peak: 40TF/s (proc peak 8GF/s)  OS : Extended version of Super-UX: 64 bit Unix OS based on System V-R3  Connection structure : a single stage crossbar network (1400 miles of cable), 83,000 copper cables: 7.9 TB/s aggregate switching capacity 12.3 GB/s bi-di between any two nodes  Global Barrier Counter within interconnect allows global barrier synch <3.5usec  Storage: 480 TB Disk, 1.5 PB Tape  Compilers : Fortran 90, HPF, ANSI C, C++  Batch: similar to NQS, PBS  Parallelization: vectorization processor level OpenMP, Pthreads, MPI, HPF

Cray X1 Overview SSP: 3.2GF computational core VL = 64, dual pipes (800 MHz) 2-way scalar 0.4 GF (400MHz) MSP: 12.8 GF combines 4 SSP shares 2MB data cache (unique) Node: 4 MSP w/ flat shared mem Interconnect: modified 2D torus fewer links then full crossbar but smaller bisection bandwidth Globally addressable: procs can directly read/write to global mem Parallelization: Vectorization (SSP) Multistreaming (MSP) Shared mem (OMP, Pthreads) Inter-node (MPI2, CAF, UPC) Node SSPMSP

Altix3000 Overview  1.5GHz (peak 6 GF/s) 128 FP registers, 32K L1, 256K L2, 6MB L3 Cannot store FP in values in L1  EPIC Bundles instruction Bundles processed in-order, instructions within bundle processed in parallel  Consists of “Cbricks” : 4 Itanium2, memory, 2 controller ASICS called SHUB  Uses high bandwidth, low latency Numalink3 interconnect (fat-tree)  Implements CCNUMA protocol in hardware A cache miss caused data to be communicated/replicated via SHUB  Uses 64-bit Linux with single system image (256 processor / few for OS services)  Scalability to large numbers of processors ?

LBMHD: Performance  Preliminary time breakdown shown relative to each architecture  Cray X1 has highest % spent in communication (P=64), CAF version reduced this  ES shows best memory bandwidth performance (stream )  Communication increases at higher scalability (as expected) –0–0 – 10 – 20 – 30 – 40 – 50 – 60 – 70 – 80 – P3 – P4 – ES – X1 – % time – collision – stream – comm 8192 x 8192 Grid 64 processors –0–0 – 10 – 20 – 30 – 40 – 50 – 60 – 70 – 80 – % time – P3 – P4 – ES 8192 x 8192 Grid 256 processors

GTC: Porting Details  Large vector memory footprint required eliminate dependencies P=64 uses 42 GB on ES compared w/ 5 GB on Power3  Relatively small mem per processor (ES=2GB, X1=4GB) limits problem size runs  GTC has second level of parallelism via OpenMP (hybrid programming). However, on ES/X1 memory footprint increased: additional 8X, about 320GB  Non-vectorized “Shift” routine accounted for: 54% X1, 11% on ES Due to high penalty of serialized sections on X1 when multistreaming  The shift routine vectorized on X1, but NOT on ES - X1 has advantage  Limited time at ES prevented vectorization of shift routine Now shift account for only 4% of X1 runtime

Second ES visit  Evaluate high-concurrency PARATEC performance using large-scale Quantum Dot simulation  Evaluate CACTUS performance using updated vectorization of radiation boundary condition  Evaluate MADCAP performance using a newly optimized version, without global file systems requirements and improved I/O behavior  Examine 3D version of LBMHD, and explore optimization strategies  Evaluate GTC performance using updated vectorization of shift routine as well as new particle decomposition approach designed to increase concurrency  Evaluate performance of FVCAM3 (Finite Volume atmospheric model), at high concurrencies and resolution (1x1.25, 0.5 x 0.625, 0.25 x 0.375) Papers available at

CMB Science The CMB is a unique probe of the very early Universe. Tiny fluctuations in its temperature & polarization encode - the fundamental parameters of cosmology Universe geometry, expansion rate, number of neutrino species, ionization history, dark matter, cosmological constant - ultra-high energy physics beyond the Standard Model