TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey)
Real TF on Cray Xn Who needs capability computing? Application requirements Why Xn? Laundry, Clean and Otherwise Rants Custom vs. Commodity MPI CAF Cray
Who needs capability computing? OMB? Politicians? Vendors? Center directors? Computer scientists?
Who needs capability computing? Application scientists According to scientists themselves
Personal Communications Fusion General Atomics, Iowa, ORNL, PPPL, Wisconsin Climate LANL, NCAR, ORNL, PNNL Materials Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin Biology NCI, ORNL, PNNL Chemistry Auburn, LANL, ORNL, PNNL Astrophysics Arizona, Chicago, NC State, ORNL, Tennessee
Scientists Need Capability Climate scientists need simulation fidelity to support policy decisions All we can say now is that humans cause warming Fusion scientists need to simulate fusion devices All we can do now is model decoupled subprocesses at disparate time scales Materials scientists need to design new materials Just starting to reproduce known materials
Scientists Need Capability Biologists need to simulate proteins and protein pathways Baby steps with smaller molecules Chemists need similar increases in complexity Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) Low-res, 3D CFD, approximate 3D neutrinos, short times
Why Scientists Might Resist Capacity also needed Software isn't ready Coerced to run capability-sized jobs on inappropriate systems
Capability Requirements Sample DOE SC applications Climate: POP, CAM Fusion: AORSA, Gyro Materials: LSMS, DCA-QMC
Parallel Ocean Program (POP) Baroclinic 3D, nearest neighbor, scalable Memory-bandwidth limited Barotropic 2D implicit system, latency bound Ocean-only simulation Higher resolution Faster time steps As ocean component for CCSM Atmosphere dominates
Community Atmospheric Model (CAM) Atmosphere component for CCSM Higher resolution? Physics changes, parameterization must be retuned, model must be revalidated Major effort, rare event Spectral transform not dominant Dramatic increases in computation per grid point Dynamic vegetation, carbon cycle, atmospheric chemistry, … Faster time steps
All-Orders Spectral Algorithm (AORSA) Radio-frequency fusion-plasma simulation Highly scalable Dominated by ScaLAPACK Still in weak-scaling regime But… Expanded physics reducing ScaLAPACK dominance Developing sparse formulation
Gyro Continuum gyrokinetic simulation of fusion-plasma microturbulence 1D data decomposition Spectral method - high communication volume Some need for increased resolution More iterations
Locally Self-Consistent Multiple Scattering (LSMS) Calculates electronic structure of large systems One atom per processor Dominated by local DGEMM First real application to sustain a TF But… moving to sparse formulation with a distributed solve for each atom
Dynamic Cluster Aproximation (DCA-QMC) Simulates high-temp superconductors Dominated by DGER (BLAS2) Memory-bandwidth limited Quantum Monte Carlo, but… Fixed start-up per process Favors fewer, faster processors Needs powerful processors to avoid parallelizing each Monte-Carlo stream
Few DOE SC Applications Weak-ish scaling Dense linear algebra But moving to sparse
Many DOE SC Applications "Strong-ish" scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps Fewer, more-powerful processors High memory bandwidth High-bandwidth, low-latency communication
Why X1? "Strong-ish" scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps Fewer, more-powerful processors High memory bandwidth High-bandwidth, low-latency communication
Tangent: Strongish* Scaling Firm Semistrong Unweak Strongoidal MSTW (More Strong Than Weak) JTSoS (Just This Side of Strong) WNS (Well-Nigh Strong) Seak, Steak, Streak, Stroak, Stronk Weag, Weng, Wong, Wrong, Twong * Greg Lindahl, Vendor Scum
X1 for 100 TF Sustained? Uh, no OS not scalable, fault-resilient enough for 10 4 processors That "price/performance" thing That "power & cooling" thing
Xn for 100 TF Sustained For DOE SC applications, YES Most-promising candidate -or- Least-implausible candidate
Why X, again? Most-powerful processors Reduce need for scalability Obey Amdahl's Law High memory bandwidth See above Globally addressable memory Lowest, most hide-able latency Scale latency-bound applications High interconnect bandwidth Scale bandwidth-bound applications
The Bad News Scalar performance "Some tuning required" Ho-hum MPI latency See Rants
Scalar Performance Compilation is slow Amdahl's Law for single processes Parallelization -> Vectorization Hard to port GNU tools GCC? Are you kidding? GCC compatibility, on the other hand… Black Widow will be better
"Some Tuning Required" Vectorization requires: Independent operations Dependence information Mapping to vector instructions Applications take a wide spectrum of steps to inhibit this May need a couple of compiler directives May need extensive rewriting
Application Results Awesome Indifferent Recalcitrant Hopeless
Awesome Results 256-MSP X1 already showing unique capability Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency POP, Gyro, DCA-QMC, AGILE- BOLTZTRAN, VH1, Amber, … Many examples from DoD
Indifferent Results Cray X1 is brute-force fast, but not cost effective Dense linear algebra Linpack, AORSA, LSMS
Recalcitrant Results Inherent algorithms are fine Source code or ongoing code mods don't vectorize Significant code rewriting done, ongoing, or needed CLM, CAM, Nimrod, M3D
Aside: How to Avoid Vectorization Use pointers to add false dependencies Put deep call stacks inside loops Put debug I/O operations inside compute loops Did I mention using pointers?
Aside: Software Design In general, we don't know how to systematically design efficient, maintainable HPC software Vectorization imposes constraints on software design Bad: Existing software must be rewritten Good: Resulting software often faster on modern superscalar systems "Some tuning required" for X series Bad: You must tune Good: Tuning is systematic, not a Black Art Vectorization "constraints" may help us develop effective design patterns for HPC software
Hopeless Results Dominated by unvectorizable algorithms Some benchmark kernels of questionable relevance No known DOE SC applications
Summary DOE SC scientists do need 100 TF and beyond of sustained application performance Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond
"Custom" Rant "Custom vs. Commodity" is Red Herring CMOS is commodity Memory is commodity Wires are commodity Cooling is independent of vector vs. scalar PNNL liquid-cooling clusters Vector systems may move to air-cooling All vendors do custom packaging Real issue: Software
MPI Rant Latency-bound apps often limited by "MPI_Allreduce(…, MPI_SUM, …)" Not ping pong! An excellent abstraction that is imminently optimizable Some apps are limited by point-to-point Remote load/store implementations (CAF, UPC) have performance advantages over MPI But MPI could be implemented using load/store, inlined, and optimized On the other hand, easier to avoid pack/unpack with load/store model
Co-Array-Fortran Rant No such thing as one-sided communication It's all two sided: send+receive, sync+put+sync, sync+get+sync Same parallel algorithms CAF mods can be highly nonlocal Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers' callers, etc. Rarely the case for MPI We use CAF to avoid MPI-implementation performance inadequacies Avoiding nonlocality by cheating with Cray pointers
Cray Rant Cray XD1 (OctigaBay) follows in tradition of T3E
Cray Rant Cray XD1 (OctigaBay) follows in tradition of T3E Very promising architecture Dumb name Interesting competitor with Red Storm
Questions? James B. White III (Trey)