Presentation is loading. Please wait.

Presentation is loading. Please wait.

11/18/08 1 An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?

Similar presentations


Presentation on theme: "11/18/08 1 An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?"— Presentation transcript:

1 11/18/08 1 An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time? Rich Loft Director, Technology Development Computational and Information Systems Laboratory National Center for Atmospheric Research loft@ucar.edu

2 Main Points Nature of the climate system makes it a grand challenge computing problem. We are at a critical juncture: we need regional climate prediction capabilities! Computer clock/thread speeds are stalled: massive parallelism is the future of supercomputing. Our best algorithms, parallelization strategies and architectures are inadequate to the task. We need model acceleration improvements in all three areas if we are to meet the challenge. 11/18/08 2

3 Options for Application Acceleration Scalability –Eliminate bottlenecks –Find more parallelism –Load balancing algorithms Algorithmic Acceleration –Bigger Timesteps Semi-Lagrangian Transport Implicit or semi-implicit time integration – solvers –Fewer Points Adaptive Mesh Refinement methods Hardware Acceleration –More Threads CMP, GP-GPU’s –Faster threads device innovations (high-K) –Smarter threads Architecture - old tricks, new tricks… magic tricks –Vector units, GPU’s, FPGA’s 11/18/08 3

4 4 Viner (2002) A Very Grand Challenge: Coupled Models of the Earth System Typical Model Computation: - 15 minute time steps - 1 peta-flop per model year ~150 km There are 3.5 million timesteps in a century air column water column

5 11/18/08 5 Multicomponent Earth System Model Atmosphere Ocean Coupler Sea IceLand C/N Cycle Dyn. Veg. Ecosystem & BGC Gas chem. Prognostic Aerosols Upper Atm. Land Use Ice Sheets Software Challenges: Increasing Complexity Validation and Verification Understanding the Output Key concept: A flexible coupling framework is critical!

6 Climate Change Credit: Caspar Amman NCAR 11/18/08 6

7 7 oIPCC AR4: “Warming of the climate system is un- equivocal” … o…and it is “very likely” caused by human activities. oMost of the observed changes over the past 50 years are now simulated by climate models adding confidence to future projections. oModel Resolutions: O(100 km) IPCC AR4 - 2007

8 Climate Change Research Epochs Reproduce historical trends Investigate climate change Run IPCC Scenarios Assess regional impacts Simulate adaptation strategies Simulate geoengineering solns Before IPCC AR4 After Curiosity DrivenPolicy Driven 11/18/08 8 2007

9 11/18/08 9 ESSL - The Earth & Sun Systems Laboratory Where we want to go: The Exascale Earth System Model Vision Coupled Ocean-Land-Atmosphere Model ~1 km x ~1 km (cloud- resolving) 100 levels, whole atmosphere Unstructured, adaptive grids ~100 m 10 levels Landscape-resolving ~10 km x ~10 km (eddy- resolving) 100 levels Unstructured, adaptive grids Requirement: Computing power enhancement by as much as a factor of 10 10 -10 12 YIKES!

10 Compute Factors for ultra-high resolution Earth System Model 11/18/08 10 Spatial resolutionProvide regional details10 3 -10 5 Model completenessAdd “new” science10 2 New parameterizationsUpgrade to “better” science 10 2 Run lengthLong-term implications10 2 Ensembles, scenariosRange of model variability 10 Total Compute Factor 10 10- 10 12 (courtesy of John Drake, ORNL)

11 Why run-length: global thermohaline circulation timescale: 3,000 years 11/18/08 11

12 11/18/08 12 Why resolution: Atmospheric convective (cloud) scales in the : O(1 km)

13 11/18/08 13 Why High Resolution in the Ocean? Ocean component of CCSM (Collins et al, 2006) Eddy-resolving POP (Maltrud & McClean,2005) 1˚0.1˚

14 11/18/08 14 High Resolution and the Land Surface

15 11/18/08 15 Performance Improvements are not coming fast enough! …suggests 10 10 to 10 12 improvement will take 40 years

16 ITRS Roadmap: feature size dropping 14%/year By 2050 reaches the size of an atom – oops! 11/18/08 16

17 11/18/08 17 National Security Agency - The power consumption of today's advanced computing systems is rapidly becoming the limiting factor with respect to improved/increased computational ability."

18 11/18/08 18 Chip Level Trends: Stagnant Clock Speed Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) Chip density is continuing increase ~2x every 2 years –Clock speed is not –Number of cores are doubling instead There is little or no additional hidden parallelism (ILP) Parallelism must be exploited by software

19 11/18/08 19 Moore’s Law -> More’s Law: Speed-up through increasing parallelism How long can we double the number of cores per chip?

20 11/18/08 20 Dr. Henry Tufo and myself with “frost” (2005) Characteristics: 2048 Processors/5.7 TF PPC 440 (750 MHz) Two processors/node 512 MB memory per node 6 TB file system NCAR and University Colorado Partner to Experiment with Blue Gene/L

21 11/18/08 21 Status and immediate plans for high resolution Earth System Modeling

22 11/18/08 22 Current high resolution CCSM runs 0.25  ATM,LND + 0.1  OCN,ICE [ATLAS/LLNL] –3280 processors –0.42 simulated years/day (SYPD) –187K CPU hours/year 0.50  ATM,LND + 0.1  OCN,ICE [FRANKLIN/NERSC] –Current 5416 processors 1.31 SYPD 99K CPU hours/year –“Efficiency Goal 4932 processors 1.80 SYPD 66K CPU hours/year

23 11/18/0823 Current 0.5  CCSM “fuel efficient” configuration [franklin] 5416 processors 168 sec. OCN [np=3600] 120 sec. ATM [np=1664] 52 sec. CPL [np=384] 21 sec. LND [np=16] ICE [np=1800] 91 sec.

24 11/18/0824 Efficiency issues in current 0.5  CCSM configuration 5416 processors 168 sec. OCN [np=3600] 120 sec. ATM [np=1664] 52 sec. CPL [np=384] 21 sec. LND [np=16] ICE [np=1800] 91 sec. Use Space Filling Curves (SFC) in POP, reduce processor count by 13%.

25 11/18/08 25 Load Balancing: Partitioning with Space Filling Curves Partition for 3 processors

26 11/18/08 26 Space-filling Curve Partitioning for Ocean Model running on 8 Processors Key concept: no need to compute over land! Static Load Balancing…

27 11/18/08 27 Ocean Model 1/10 Degree performance Key concept: You need routine access to > 1k procs to discover true scaling behaviour!

28 11/18/0828 Efficiency issues in Current CCSM 0.5  configuration 5416 processors 168 sec. OCN [np=3600] 120 sec. ATM [np=1664] 52 sec. CPL [np=384] 21 sec. LND [np=16] ICE [np=1800] 91 sec. Use wSFC in CICE, reduce Execution time by 2x.

29 11/18/08 29 Static, Weighted Load Balancing Example: Sea Ice Model CICE4 @ 1° on 20 processors Small domains @ high latitudes Large domains @ low latitudes Courtesy of John Dennis

30 11/18/0830 Efficiency issues in current 0.5  CCSM configuration: Coupler 5416 processors 168 sec. OCN [np=3600] 120 sec. ATM [np=1664] 52 sec. CPL [np=384] 21 sec. LND [np=16] ICE [np=1800] 91 sec. Unresolved scalability issues in Coupler – Options: Better interconnect, Nested grids, PGAS language paradigm

31 11/18/0831 Efficiency issues in current 0.5  CCSM configuration: atmospheric component 5416 processors 168 sec. OCN [np=3600] 120 sec. ATM [np=1664] 52 sec. CPL [np=384] 21 sec. LND [np=16] ICE [np=1800] 91 sec. Scalability limitation in 0.5° fv-CAM [MPI] – shift to hybrid OpenMP/MPI version

32 11/18/0832 Projected 0.5  CCSM “capability” configuration: 3.8 years/day 19460 processors 62 sec. OCN [np=6100] 62 sec. ATM [np=5200] 31 sec. CPL [np=384] 21 sec. LND [np=40] ICE [np=8120] 10 sec. Action: Run hybrid atmospheric model

33 11/18/0833 Projected 0.5  CCSM “capability” configuration - version 2: 3.8 years/day 14260 processors 62 sec. OCN [np=6100] 62 sec. ATM [np=5200] 31 sec. CPL [np=384] 21 sec. LND [np=40] ICE [np=8120] 10 sec. Action: Thread ice model

34 11/18/08 34 Scalable Geometry Choice: Cube-Sphere Sphere is decomposed into 6 identical regions using a central projection (Sadourny, 1972) with equiangular grid (Rancic et al., 1996). Avoids pole problems, quasi- uniform. Non-orthogonal curvilinear coordinate system with identical metric terms Ne=16 Cube Sphere Showing degree of non-uniformity

35 11/18/08 35 Scalable Numerical Method: High-Order Methods Algorithmic Advantages of High Order Methods –h-p element-based method on quadrilaterals (N e x N e ) –Exponential convergence in polynomial degree (N) Computational Advantages of High Order Methods –Naturally cache-blocked N x N computations –Nearest-neighbor communication between elements (explicit) –Well suited to parallel µprocessor systems

36 11/18/08 36 HOMME: Computational Mesh Elements: –A quadrilateral “patch” of N x N gridpoints –Gauss-Lobatto Grid –Typically N={4-8} Cube –N e = Elements on an edge –6 x Ne x Ne elements total

37 11/18/08 37 Partitioning a cube-sphere on 8 processors

38 11/18/08 38 Partitioning a cubed-sphere on 8 processors

39 11/18/08 39 Aqua-Planet CAM/HOMME Dycore Full CAM Physics/HOMME Dycore Parallel I/O library used for physics aerosol input and input data ( work COULD NOT have been done without Parallel IO) Work underway to couple to other CCSM components 5 years/day

40 11/18/0840 Projected 0.25  CCSM “capability” configuration - version 2: 4.0 years/day 30000 processors 60 sec. OCN [np=6000] 60 sec. HOMME ATM [np=24000] 47 sec. CPL [np=3840] 8 sec. LND [np=320] ICE [np=16240] 5 sec. Action: insert scalable atmospheric dycore

41 11/18/08 41 Using a bigger parallel machine can’t be the only answer Progress in the Top 500 list is not fast enough Amdahl’s Law is formidable opponent Dynamical timestep goes like N -1 –Merciless effect of Courant limit –The cost of dynamics relative to physics increases as N –e.g. if dynamics takes 20% at 25 km it will take 86% of the time at 1 km Traditional parallelization of horizontal leaves N 2 per thread cost (vertical x horizontal) –Must inevitably slow down with stalled thread speeds

42 Options for Application Acceleration Scalability –Eliminate bottlenecks –Find more parallelism –Load balancing algorithms Algorithmic Acceleration –Bigger Timesteps Semi-Lagrangian Transport Implicit or semi-implicit time integration – solvers –Fewer Points Adaptive Mesh Refinement methods Hardware Acceleration –More Threads CMP, GP-GPU’s –Faster threads device innovations (high-K) –Smarter threads Architecture - old tricks, new tricks… magic tricks –Vector units, GPU’s, FPGA’s 11/18/08 42

43 11/18/08 Accelerator Research Graphics Cards – Nvidia 9800/Cuda –Measured 109x on WRF microphysics on 9800GX2 FPGA – Xilinx (data flow model) –21.7x simulated on sw-radiation code IBM Cell Processor - 8 cores Intel Larrabee 43

44 11/18/08 44 DG+NH+AMR Curvilinear elements Overhead of parallel AMR at each time- step: less than 1% Idea based on Fischer, Kruse, Loth (02) Courtesy of Amik St. Cyr

45 11/18/08 45 SLIM ocean model Louvain la Neuve University DG, implicit, AMR unstructured To be coupled to prototype unstructured ATM model (Courtesy of J-F Remacle LNU)

46 NCAR Summer Internships in Parallel Computational Science (SIParCS) 2007-2008 Open to: –Upper division undergrads –Graduate students In Disciplines such as: –CS, Software Engineering –Applied Math, Statistics –ES Science Support: –Travel, Housing, Per diem –10 weeks salary Number of interns selected: –7 in 2007 –11 in 2008 http://www.cisl.ucar.edu/siparcs

47 11/18/08 47 Meanwhile - the clock is ticking

48 11/18/08 48 The Size of the Interdisciplinary/Interagency Team Working on Climate Scalability Contributors: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) A. St. Cyr (NCAR) J. Dennis (NCAR) J. Edwards (IBM) B. Fox-Kemper (MIT,CU) E. Hunke (LANL) B. Kadlec (CU) D. Ivanova (LLNL) E. Jedlicka (ANL) E. Jessup (CU) R. Jacob (ANL) P. Jones (LANL) S. Peacock (NCAR) K. Lindsay (NCAR) W. Lipscomb (LANL) R. Loy (ANL) J. Michalakes (NCAR) A. Mirin (LLNL) M. Maltrud (LANL) J. McClean (LLNL) R. Nair (NCAR) M. Norman (NCSU) T. Qian (NCAR) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) P. Worley (ORNL) M. Zhang (SUNYSB) Funding: –DOE-BER CCPP Program Grant DE-FC03-97ER62402 DE-PS02-07ER07-06 DE-FC02-07ER64340 B&R KP1206000 –DOE-ASCR B&R KJ0101030 –NSF Cooperative Grant NSF01 –NSF PetaApps Award Computer Time: –Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) LLNL Stony Brook & BNL –CRAY XT3/4 time: ORNL Sandia

49 11/18/08 49 Thanks! Any Questions?

50 11/18/08 50 Q. If you had a petascale computer what would you do with it? A. Use it as a prototype of an exascale computer.


Download ppt "11/18/08 1 An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?"

Similar presentations


Ads by Google