Download presentation
Presentation is loading. Please wait.
Published byNathaniel Norris Modified over 9 years ago
1
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov
2
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 2 Disclaimer The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.
3
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 3 Disclaimer (cont.) Graph-free, chart-free environment For graphs and charts http://www.csm.ornl.gov/evaluation/PHOENIX/
4
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 4 100 Real TF on Cray Xn Who needs capability computing? Application requirements Why Xn? Laundry, Clean and Otherwise Rants Custom vs. Commodity MPI CAF Cray
5
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 5 Who needs capability computing? OMB? Politicians? Vendors? Center directors? Computer scientists?
6
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 6 Who needs capability computing? Application scientists According to scientists themselves
7
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 7 Personal Communications Fusion General Atomics, Iowa, ORNL, PPPL, Wisconsin Climate LANL, NCAR, ORNL, PNNL Materials Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin Biology NCI, ORNL, PNNL Chemistry Auburn, LANL, ORNL, PNNL Astrophysics Arizona, Chicago, NC State, ORNL, Tennessee
8
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 8 Scientists Need Capability Climate scientists need simulation fidelity to support policy decisions All we can say now is that humans cause warming Fusion scientists need to simulate fusion devices All we can do now is model decoupled subprocesses at disparate time scales Materials scientists need to design new materials Just starting to reproduce known materials
9
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 9 Scientists Need Capability Biologists need to simulate proteins and protein pathways Baby steps with smaller molecules Chemists need similar increases in complexity Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times) Low-res, 3D CFD, approximate 3D neutrinos, short times
10
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 10 Why Scientists Might Resist Capacity also needed Software isn’t ready Coerced to run capability-sized jobs on inappropriate systems
11
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 11 Capability Requirements Sample DOE SC applications Climate: POP, CAM Fusion: AORSA, Gyro Materials: LSMS, DCA-QMC
12
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 12 Parallel Ocean Program (POP) Baroclinic 3D, nearest neighbor, scalable Memory-bandwidth limited Barotropic 2D implicit system, latency bound Ocean-only simulation Higher resolution Faster time steps As ocean component for CCSM Atmosphere dominates
13
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 13 Community Atmospheric Model (CAM) Atmosphere component for CCSM Higher resolution? Physics changes, parameterization must be retuned, model must be revalidated Major effort, rare event Spectral transform not dominant Dramatic increases in computation per grid point Dynamic vegetation, carbon cycle, atmospheric chemistry, … Faster time steps
14
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 14 All-Orders Spectral Algorithm (AORSA) Radio-frequency fusion-plasma simulation Highly scalable Dominated by ScaLAPACK Still in weak-scaling regime But… Expanded physics reducing ScaLAPACK dominance Developing sparse formulation
15
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 15 Gyro Continuum gyrokinetic simulation of fusion-plasma microturbulence 1D data decomposition Spectral method - high communication volume Some need for increased resolution More iterations
16
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 16 Locally Self-Consistent Multiple Scattering (LSMS) Calculates electronic structure of large systems One atom per processor Dominated by local DGEMM First real application to sustain a TF But… moving to sparse formulation with a distributed solve for each atom
17
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 17 Dynamic Cluster Aproximation (DCA-QMC) Simulates high-temp superconductors Dominated by DGER (BLAS2) Memory-bandwidth limited Quantum Monte Carlo, but… Fixed start-up per process Favors fewer, faster processors Needs powerful processors to avoid parallelizing each Monte-Carlo stream
18
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 18 Few DOE SC Applications Weak-ish scaling Dense linear algebra But moving to sparse
19
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 19 Many DOE SC Applications “Strong-ish” scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps Fewer, more-powerful processors High memory bandwidth High-bandwidth, low-latency communication
20
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 20 Why X1? “Strong-ish” scaling Limited increase in gridpoints Major increase in expense per gridpoint Major increase in time steps Fewer, more-powerful processors High memory bandwidth High-bandwidth, low-latency communication
21
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 21 Tangent: Strongish* Scaling Firm Semistrong Unweak Strongoidal MSTW (More Strong Than Weak) JTSoS (Just This Side of Strong) WNS (Well-Nigh Strong) Seak, Steak, Streak, Stroak, Stronk Weag, Weng, Wong, Wrong, Twong * Greg Lindahl, Vendor Scum
22
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 22 X1 for 100 TF Sustained? Uh, no OS not scalable, fault-resilient enough for 10 4 processors That “price/performance” thing That “power & cooling” thing
23
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 23 Xn for 100 TF Sustained For DOE SC applications, YES Most-promising candidate -or- Least-implausible candidate
24
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 24 Why X, again? Most-powerful processors Reduce need for scalability Obey Amdahl’s Law High memory bandwidth See above Globally addressable memory Lowest, most hide-able latency Scale latency-bound applications High interconnect bandwidth Scale bandwidth-bound applications
25
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 25 The Bad News Scalar performance “Some tuning required” Ho-hum MPI latency See Rants
26
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 26 Scalar Performance Compilation is slow Amdahl’s Law for single processes Parallelization -> Vectorization Hard to port GNU tools GCC? Are you kidding? GCC compatibility, on the other hand… Black Widow will be better
27
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 27 “Some Tuning Required” Vectorization requires: Independent operations Dependence information Mapping to vector instructions Applications take a wide spectrum of steps to inhibit this May need a couple of compiler directives May need extensive rewriting
28
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 28 Application Results Awesome Indifferent Recalcitrant Hopeless
29
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 29 Awesome Results 256-MSP X1 already showing unique capability Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency POP, Gyro, DCA-QMC, AGILE- BOLTZTRAN, VH1, Amber, … Many examples from DoD
30
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 30 Indifferent Results Cray X1 is brute-force fast, but not cost effective Dense linear algebra Linpack, AORSA, LSMS
31
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 31 Recalcitrant Results Inherent algorithms are fine Source code or ongoing code mods don’t vectorize Significant code rewriting done, ongoing, or needed CLM, CAM, Nimrod, M3D
32
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 32 Aside: How to Avoid Vectorization Use pointers to add false dependencies Put deep call stacks inside loops Put debug I/O operations inside compute loops Did I mention using pointers?
33
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 33 Aside: Software Design In general, we don’t know how to systematically design efficient, maintainable HPC software Vectorization imposes constraints on software design Bad: Existing software must be rewritten Good: Resulting software often faster on modern superscalar systems “Some tuning required” for X series Bad: You must tune Good: Tuning is systematic, not a Black Art Vectorization “constraints” may help us develop effective design patterns for HPC software
34
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 34 Hopeless Results Dominated by unvectorizable algorithms Some benchmark kernels of questionable relevance No known DOE SC applications
35
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 35 Summary DOE SC scientists do need 100 TF and beyond of sustained application performance Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond
36
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 36 “Custom” Rant “Custom vs. Commodity” is Red Herring CMOS is commodity Memory is commodity Wires are commodity Cooling is independent of vector vs. scalar PNNL liquid-cooling clusters Vector systems may move to air-cooling All vendors do custom packaging Real issue: Software
37
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 37 MPI Rant Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)” Not ping pong! An excellent abstraction that is imminently optimizable Some apps are limited by point-to-point Remote load/store implementations (CAF, UPC) have performance advantages over MPI But MPI could be implemented using load/store, inlined, and optimized On the other hand, easier to avoid pack/unpack with load/store model
38
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 38 Co-Array-Fortran Rant No such thing as one-sided communication It’s all two sided: send+receive, sync+put+sync, sync+get+sync Same parallel algorithms CAF mods can be highly nonlocal Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc. Rarely the case for MPI We use CAF to avoid MPI-implementation performance inadequacies Avoiding nonlocality by cheating with Cray pointers
39
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 39 Cray Rant Cray XD1 (OctigaBay) follows in tradition of T3E
40
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 40 Cray Rant Cray XD1 (OctigaBay) follows in tradition of T3E Very promising architecture Dumb name Interesting competitor with Red Storm
41
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 41 Questions? James B. White III (Trey) trey@ornl.gov http://www.csm.ornl.gov/evaluation/PHOENIX/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.