Presentation is loading. Please wait.

Presentation is loading. Please wait.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,

Similar presentations


Presentation on theme: "O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,"— Presentation transcript:

1 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13, 2004 James B. White III (Trey) trey@ornl.gov

2 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 2 Disclaimer  The opinions expressed here do not necessarily represent those of the CCS, ORNL, DOE, the Executive Branch of the Federal Government of the United States of America, or even UT-Battelle.

3 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 3 Disclaimer (cont.)  Graph-free, chart-free environment  For graphs and charts http://www.csm.ornl.gov/evaluation/PHOENIX/

4 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 4 100 Real TF on Cray Xn  Who needs capability computing?  Application requirements  Why Xn?  Laundry, Clean and Otherwise  Rants  Custom vs. Commodity  MPI  CAF  Cray

5 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 5 Who needs capability computing?  OMB?  Politicians?  Vendors?  Center directors?  Computer scientists?

6 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 6 Who needs capability computing?  Application scientists  According to scientists themselves

7 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 7 Personal Communications  Fusion  General Atomics, Iowa, ORNL, PPPL, Wisconsin  Climate  LANL, NCAR, ORNL, PNNL  Materials  Cincinnati, Florida, NC State, ORNL, Sandia, Wisconsin  Biology  NCI, ORNL, PNNL  Chemistry  Auburn, LANL, ORNL, PNNL  Astrophysics  Arizona, Chicago, NC State, ORNL, Tennessee

8 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 8 Scientists Need Capability  Climate scientists need simulation fidelity to support policy decisions  All we can say now is that humans cause warming  Fusion scientists need to simulate fusion devices  All we can do now is model decoupled subprocesses at disparate time scales  Materials scientists need to design new materials  Just starting to reproduce known materials

9 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 9 Scientists Need Capability  Biologists need to simulate proteins and protein pathways  Baby steps with smaller molecules  Chemists need similar increases in complexity  Astrophysics need to simulate nucleogenesis (high-res, 3D CFD, 6D neutrinos, long times)  Low-res, 3D CFD, approximate 3D neutrinos, short times

10 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 10 Why Scientists Might Resist  Capacity also needed  Software isn’t ready  Coerced to run capability-sized jobs on inappropriate systems

11 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 11 Capability Requirements  Sample DOE SC applications  Climate: POP, CAM  Fusion: AORSA, Gyro  Materials: LSMS, DCA-QMC

12 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 12 Parallel Ocean Program (POP)  Baroclinic  3D, nearest neighbor, scalable  Memory-bandwidth limited  Barotropic  2D implicit system, latency bound  Ocean-only simulation  Higher resolution  Faster time steps  As ocean component for CCSM  Atmosphere dominates

13 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 13 Community Atmospheric Model (CAM)  Atmosphere component for CCSM  Higher resolution?  Physics changes, parameterization must be retuned, model must be revalidated  Major effort, rare event  Spectral transform not dominant  Dramatic increases in computation per grid point  Dynamic vegetation, carbon cycle, atmospheric chemistry, …  Faster time steps

14 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 14 All-Orders Spectral Algorithm (AORSA)  Radio-frequency fusion-plasma simulation  Highly scalable  Dominated by ScaLAPACK  Still in weak-scaling regime  But…  Expanded physics reducing ScaLAPACK dominance  Developing sparse formulation

15 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 15 Gyro  Continuum gyrokinetic simulation of fusion-plasma microturbulence  1D data decomposition  Spectral method - high communication volume  Some need for increased resolution  More iterations

16 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 16 Locally Self-Consistent Multiple Scattering (LSMS)  Calculates electronic structure of large systems  One atom per processor  Dominated by local DGEMM  First real application to sustain a TF  But… moving to sparse formulation with a distributed solve for each atom

17 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 17 Dynamic Cluster Aproximation (DCA-QMC)  Simulates high-temp superconductors  Dominated by DGER (BLAS2)  Memory-bandwidth limited  Quantum Monte Carlo, but…  Fixed start-up per process  Favors fewer, faster processors  Needs powerful processors to avoid parallelizing each Monte-Carlo stream

18 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 18 Few DOE SC Applications  Weak-ish scaling  Dense linear algebra  But moving to sparse

19 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 19 Many DOE SC Applications  “Strong-ish” scaling  Limited increase in gridpoints  Major increase in expense per gridpoint  Major increase in time steps  Fewer, more-powerful processors  High memory bandwidth  High-bandwidth, low-latency communication

20 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 20 Why X1?  “Strong-ish” scaling  Limited increase in gridpoints  Major increase in expense per gridpoint  Major increase in time steps  Fewer, more-powerful processors  High memory bandwidth  High-bandwidth, low-latency communication

21 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 21 Tangent: Strongish* Scaling  Firm  Semistrong  Unweak  Strongoidal  MSTW (More Strong Than Weak)  JTSoS (Just This Side of Strong)  WNS (Well-Nigh Strong)  Seak, Steak, Streak, Stroak, Stronk  Weag, Weng, Wong, Wrong, Twong * Greg Lindahl, Vendor Scum

22 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 22 X1 for 100 TF Sustained?  Uh, no  OS not scalable, fault-resilient enough for 10 4 processors  That “price/performance” thing  That “power & cooling” thing

23 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 23 Xn for 100 TF Sustained  For DOE SC applications, YES  Most-promising candidate -or-  Least-implausible candidate

24 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 24 Why X, again?  Most-powerful processors  Reduce need for scalability  Obey Amdahl’s Law  High memory bandwidth  See above  Globally addressable memory  Lowest, most hide-able latency  Scale latency-bound applications  High interconnect bandwidth  Scale bandwidth-bound applications

25 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 25 The Bad News  Scalar performance  “Some tuning required”  Ho-hum MPI latency  See Rants

26 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 26 Scalar Performance  Compilation is slow  Amdahl’s Law for single processes  Parallelization -> Vectorization  Hard to port GNU tools  GCC? Are you kidding?  GCC compatibility, on the other hand…  Black Widow will be better

27 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 27 “Some Tuning Required”  Vectorization requires:  Independent operations  Dependence information  Mapping to vector instructions  Applications take a wide spectrum of steps to inhibit this  May need a couple of compiler directives  May need extensive rewriting

28 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 28 Application Results  Awesome  Indifferent  Recalcitrant  Hopeless

29 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 29 Awesome Results  256-MSP X1 already showing unique capability  Apps bound by memory bandwidth, interconnect bandwidth, interconnect latency  POP, Gyro, DCA-QMC, AGILE- BOLTZTRAN, VH1, Amber, …  Many examples from DoD

30 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 30 Indifferent Results  Cray X1 is brute-force fast, but not cost effective  Dense linear algebra  Linpack, AORSA, LSMS

31 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 31 Recalcitrant Results  Inherent algorithms are fine  Source code or ongoing code mods don’t vectorize  Significant code rewriting done, ongoing, or needed  CLM, CAM, Nimrod, M3D

32 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 32 Aside: How to Avoid Vectorization  Use pointers to add false dependencies  Put deep call stacks inside loops  Put debug I/O operations inside compute loops  Did I mention using pointers?

33 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 33 Aside: Software Design  In general, we don’t know how to systematically design efficient, maintainable HPC software  Vectorization imposes constraints on software design  Bad: Existing software must be rewritten  Good: Resulting software often faster on modern superscalar systems  “Some tuning required” for X series  Bad: You must tune  Good: Tuning is systematic, not a Black Art  Vectorization “constraints” may help us develop effective design patterns for HPC software

34 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 34 Hopeless Results  Dominated by unvectorizable algorithms  Some benchmark kernels of questionable relevance  No known DOE SC applications

35 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 35 Summary  DOE SC scientists do need 100 TF and beyond of sustained application performance  Cray X series is the least-implausible option for scaling DOE SC applications to 100 TF of sustained performance and beyond

36 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 36 “Custom” Rant  “Custom vs. Commodity” is Red Herring  CMOS is commodity  Memory is commodity  Wires are commodity  Cooling is independent of vector vs. scalar  PNNL liquid-cooling clusters  Vector systems may move to air-cooling  All vendors do custom packaging  Real issue: Software

37 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 37 MPI Rant  Latency-bound apps often limited by “MPI_Allreduce(…, MPI_SUM, …)”  Not ping pong!  An excellent abstraction that is imminently optimizable  Some apps are limited by point-to-point  Remote load/store implementations (CAF, UPC) have performance advantages over MPI  But MPI could be implemented using load/store, inlined, and optimized  On the other hand, easier to avoid pack/unpack with load/store model

38 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 38 Co-Array-Fortran Rant  No such thing as one-sided communication  It’s all two sided: send+receive, sync+put+sync, sync+get+sync  Same parallel algorithms  CAF mods can be highly nonlocal  Adding CAF in a subroutine can have implications on the argument types, and thus on the callers, the callers’ callers, etc.  Rarely the case for MPI  We use CAF to avoid MPI-implementation performance inadequacies  Avoiding nonlocality by cheating with Cray pointers

39 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 39 Cray Rant  Cray XD1 (OctigaBay) follows in tradition of T3E

40 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 40 Cray Rant  Cray XD1 (OctigaBay) follows in tradition of T3E  Very promising architecture  Dumb name  Interesting competitor with Red Storm

41 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 41 Questions? James B. White III (Trey) trey@ornl.gov http://www.csm.ornl.gov/evaluation/PHOENIX/


Download ppt "O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY The Center for Computational Sciences 1 100 TF Sustained on Cray X Series SOS 8 April 13,"

Similar presentations


Ads by Google