1 Scaling CCSM to a Petascale system John M. Dennis: June 22, 2006 John M. Dennis: June 22,

Slides:

Advertisements

Similar presentations

1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Thoughts on Shared Caches Jeff Odom University of Maryland.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Global Climate Modeling Research John Drake Computational Climate Dynamics Group Computer.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1: Operating Systems Overview

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Memory Management 2010.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Chapter 3 Memory Management: Virtual Memory

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT Adoption and field tests of M.I.T General Circulation Model (MITgcm) with ESMF Chris Hill ESMF.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007.

CESM/RACM/RASM Update May 15, Since Nov, 2011 ccsm4_0_racm28:racm29:racm30 – vic parallelization – vic netcdf files – vic coupling mods and “273.15”

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.

Overview of ESMF in the Community Climate System Model (CCSM) Erik Kluzek NCAR -- CCSM Software Engineering Group (CSEG) Erik Kluzek NCAR -- CCSM Software.

Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

John Dennis ENES Workshop on HPC for Climate Models 1.

ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

CSEG Update Mariana Vertenstein CCSM Software Engineering Group Mariana Vertenstein CCSM Software Engineering Group.

CESM/ESMF Progress Report Mariana Vertenstein NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science.

PetaApps: Update on software engineering and performance J. Dennis M. Vertenstein N. Hearn.

ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.

Earth System Modeling Framework Status Cecelia DeLuca NOAA Cooperative Institute for Research in Environmental Sciences University of Colorado, Boulder.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Regional Models in CCSM CCSM/POP/ROMS: Regional Nesting and Coupling Jon Wolfe (CSEG) Mariana Vertenstein (CSEG) Don Stark (ESMF)

Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR)

Data Structures and Algorithms in Parallel Computing Lecture 7.

Outline Why this subject? What is High Performance Computing?

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

SciDAC CCSM Consortium: Software Engineering Update Patrick Worley Oak Ridge National Laboratory (On behalf of all the consorts) Software Engineering Working.

Overview of the CCSM CCSM Software Engineering Group June

Xing Cai University of Oslo

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Mariana Vertenstein CCSM Software Engineering Group NCAR

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Parallel Programming in C with MPI and OpenMP

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

1 Scaling CCSM to a Petascale system John M. Dennis: June 22, 2006 John M. Dennis: June 22, 2006

Software Engineering Working Group Meeting 2 Motivation  Petascale system with 100K - 500K proc  Trend or One off?  LLNL: 128K proc IBM BG/L  IBM Watson: 40K proc IBM BG/L  Sandia: 10K proc IBM RedStorm  ORNL/NCCS: 5K proc Cray XT3  10K (end of summer) -> 20K (Nov 06) -> ?  ANL: Large IBM BG/P system  We have prototypes for Petascale system!  Petascale system with 100K - 500K proc  Trend or One off?  LLNL: 128K proc IBM BG/L  IBM Watson: 40K proc IBM BG/L  Sandia: 10K proc IBM RedStorm  ORNL/NCCS: 5K proc Cray XT3  10K (end of summer) -> 20K (Nov 06) -> ?  ANL: Large IBM BG/P system  We have prototypes for Petascale system!

June 22, 2006Software Engineering Working Group Meeting 3 Motivation (con’t)  Prototype Petascale Application?  0.1 degree  BGW 30K proc --> 7.9 years/wallclock day  RedStorm 8K proc--> 8.1 years/wallclock day  Can CCSM be a Petascale Application?  Look at each component separately  Current scalability limitations  Changes necessary to enable execution on large processor counts  Check scalability on BG/L  Prototype Petascale Application?  0.1 degree  BGW 30K proc --> 7.9 years/wallclock day  RedStorm 8K proc--> 8.1 years/wallclock day  Can CCSM be a Petascale Application?  Look at each component separately  Current scalability limitations  Changes necessary to enable execution on large processor counts  Check scalability on BG/L

June 22, 2006Software Engineering Working Group Meeting 4 Motivation (con’t)  Why examine scalability on BG/L?  Prototype for Petascale system  Access to large processor counts  2K easily  40K through Blue Gene Watson Days  Scalable architecture  Limited memory:  256MB (VN)  512MB (CO)  Dedicated resources gives reproducible timings  Lessons translate to other systems [Cray XT3]  Why examine scalability on BG/L?  Prototype for Petascale system  Access to large processor counts  2K easily  40K through Blue Gene Watson Days  Scalable architecture  Limited memory:  256MB (VN)  512MB (CO)  Dedicated resources gives reproducible timings  Lessons translate to other systems [Cray XT3]

June 22, 2006Software Engineering Working Group Meeting 5 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 6 Parallel Ocean Program (POP)  Modified base POP 2.0 base code  Reduce execution time/ improve scalability  Minor changes (~9 files)  Rework barotropic solver  Improve load-balancing (space-filling curve)  Pilfered CICE boundary exchange [NEW]  Significant advances in performance  1 degree  128 POWER4 processors --> 2.1x  0.1 degree  30K BG/L processors --> 2x  8K RedStorm processors --> 1.3x  Modified base POP 2.0 base code  Reduce execution time/ improve scalability  Minor changes (~9 files)  Rework barotropic solver  Improve load-balancing (space-filling curve)  Pilfered CICE boundary exchange [NEW]  Significant advances in performance  1 degree  128 POWER4 processors --> 2.1x  0.1 degree  30K BG/L processors --> 2x  8K RedStorm processors --> 1.3x

June 22, 2006Software Engineering Working Group Meeting 7 POP using 20x24 blocks (gx1v3)  POP data structure  Flexible block structure  land ‘block’ elimination  Small blocks  Better {load balanced, land block elimination}  Larger halo overhead  Larger blocks  Smaller halo overhead  Load imbalanced  No land block elimination

June 22, 2006Software Engineering Working Group Meeting 8 Outline:  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 9 Alternate Data Structure 2D data structure  Advantages  Regular stride-1 access  Compact form of stencil operator  Disadvantages  Includes land points  Problem specific data structure 2D data structure  Advantages  Regular stride-1 access  Compact form of stencil operator  Disadvantages  Includes land points  Problem specific data structure 1D data structure  Advantages  No more land points  General data structure  Disadvantages  Indirect addressing  Larger stencil operator

June 22, 2006Software Engineering Working Group Meeting 10 Using 1D data structures in POP2 solver (serial)  Replace solvers.F90  Execution time on cache microprocessors  Examine two CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product) [D’Azevedo 93]  Grid: test  [128x192 grid points]w/(16x16)  Replace solvers.F90  Execution time on cache microprocessors  Examine two CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product) [D’Azevedo 93]  Grid: test  [128x192 grid points]w/(16x16)

June 22, 2006Software Engineering Working Group Meeting 11 Serial execution time on IBM POWER4 (test) 56% reduction in cost/iteration

June 22, 2006Software Engineering Working Group Meeting 12 Using 1D data structure in POP2 solver (parallel)  New parallel halo update  Examine several CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product)  Existing solver/preconditioner technology: Hypre (LLNL)  PCG solver  Preconditioners:  Diagonal  New parallel halo update  Examine several CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product)  Existing solver/preconditioner technology: Hypre (LLNL)  PCG solver  Preconditioners:  Diagonal

June 22, 2006Software Engineering Working Group Meeting 13 Solver execution time for POP2 (20x24) on BG/L (gx1v3) 48% cost/iteration 27% cost/iteration

June 22, 2006Software Engineering Working Group Meeting 14 Outline:  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 15 CICE boundary exchange  POP applies 2D boundary exchange to 3D vars.  3D-update 2-33% of total time  Specialized 3D boundary exchange  Reduce message count  Increase message length  Reduces dependence on machine latency  Pilfer CICE 4.0 boundary exchange  Code Reuse! :-)  POP applies 2D boundary exchange to 3D vars.  3D-update 2-33% of total time  Specialized 3D boundary exchange  Reduce message count  Increase message length  Reduces dependence on machine latency  Pilfer CICE 4.0 boundary exchange  Code Reuse! :-)

June 22, 2006Software Engineering Working Group Meeting 16 Simulation rate of gx1v3 on IBM POWER4 ret 50% of time in solver

June 22, 2006Software Engineering Working Group Meeting 17 Performance of  Three code modifications  1D data structure  Space-filling curves  CICE boundary exchange  Cumulative impact is huge  Separately 10-20% each  Together 2.1x on 128 processors Small improvements add up!  Three code modifications  1D data structure  Space-filling curves  CICE boundary exchange  Cumulative impact is huge  Separately 10-20% each  Together 2.1x on 128 processors Small improvements add up!

June 22, 2006Software Engineering Working Group Meeting 18 Outline:  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 19 Partitioning with Space-filling Curves  Map 2D -> 1D  Variety of sizes  Hilbert (Nb=2 n)  Peano (Nb=3 m)  Cinco (Nb=5 p ) [New]  Hilbert-Peano (Nb=2 n 3 m )  Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p ) [New]  Partitioning 1D array Nb

June 22, 2006Software Engineering Working Group Meeting 20 Partitioning with SFC Partition for 3 processors

June 22, 2006Software Engineering Working Group Meeting 21 POP using 20x24 blocks (gx1v3)

June 22, 2006Software Engineering Working Group Meeting 22 POP (gx1v3) + Space-filling curve

June 22, 2006Software Engineering Working Group Meeting 23 Space-filling curve (Hilbert Nb=2 4 )

June 22, 2006Software Engineering Working Group Meeting 24 Remove Land blocks

June 22, 2006Software Engineering Working Group Meeting 25 Space-filling curve partition for 8 processors

June 22, 2006Software Engineering Working Group Meeting degree POP  Global eddy-resolving  Computational grid:  3600 x 2400 x 40  Land creates problems:  load imbalances  scalability  Alternative partitioning algorithm:  Space-filling curves  Evaluate using Benchmark:  1 day/ Internal grid / 7 minute timestep  Global eddy-resolving  Computational grid:  3600 x 2400 x 40  Land creates problems:  load imbalances  scalability  Alternative partitioning algorithm:  Space-filling curves  Evaluate using Benchmark:  1 day/ Internal grid / 7 minute timestep

June 22, 2006Software Engineering Working Group Meeting 27 POP 0.1 degree benchmark on Blue Gene/L

June 22, 2006Software Engineering Working Group Meeting 28 POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley 50% of time in solver 33% of time in 3D-update

June 22, 2006Software Engineering Working Group Meeting 29 Remaining Issues: POP  Parallel I/O:  Decomposition in the vertical  Only parallel for 3D fields  Needs all to one communication  Need parallel I/O for 2D fields  Example: 0.1 degree POP on 30K BGL  Time to compute 1 day: 30 seconds  Time to read in 2D forcing files: 22 seconds  Parallel I/O:  Decomposition in the vertical  Only parallel for 3D fields  Needs all to one communication  Need parallel I/O for 2D fields  Example: 0.1 degree POP on 30K BGL  Time to compute 1 day: 30 seconds  Time to read in 2D forcing files: 22 seconds

June 22, 2006Software Engineering Working Group Meeting 30 Impact of 2x increase in simulation rate  IPCC AR5 control run [1000 years]  5 years per day ~= 6 months  10 years per day ~= 3 months  Huge jump in scientific productivity  Search larger parameter space  Longer sensitivity studies -> Find and fix problems much quicker  What about entire coupled system?  IPCC AR5 control run [1000 years]  5 years per day ~= 6 months  10 years per day ~= 3 months  Huge jump in scientific productivity  Search larger parameter space  Longer sensitivity studies -> Find and fix problems much quicker  What about entire coupled system?

June 22, 2006Software Engineering Working Group Meeting 31 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 32 CICE: Sea-ice Model  Shares grid and infrastructure with POP  CICE 4.0  Not quite ready for general release  Sub-block data structures (POP2)  Minimal experience with code base (<2 weeks)  Reuse techniques from POP2 work  Partitioning grid using weighted Space-filling curves?  Shares grid and infrastructure with POP  CICE 4.0  Not quite ready for general release  Sub-block data structures (POP2)  Minimal experience with code base (<2 weeks)  Reuse techniques from POP2 work  Partitioning grid using weighted Space-filling curves?

June 22, 2006Software Engineering Working Group Meeting 33 Weighted Space-filling curves  Estimate work for each grid block Work i = wo + P i *w1 wo: Fixed work for all blocks w1: Work if block contains Sea-ice P i :Probability block contains Sea-ice  Estimate work for each grid block Work i = wo + P i *w1 wo: Fixed work for all blocks w1: Work if block contains Sea-ice P i :Probability block contains Sea-ice

June 22, 2006Software Engineering Working Group Meeting 34 Weighted Space-filling curves (con’t)  Probability block contains Sea-ice  Depends on climate scenario  Control-run  Paelo  CO 2 doubling  Estimate of probability  Bad estimate -> Slower simulation rate  Weight space-filling curve  Partition for equal amounts of work  Probability block contains Sea-ice  Depends on climate scenario  Control-run  Paelo  CO 2 doubling  Estimate of probability  Bad estimate -> Slower simulation rate  Weight space-filling curve  Partition for equal amounts of work

June 22, 2006Software Engineering Working Group Meeting 35 Partitioning with w-SFC Partition for 5 processors

June 22, 2006Software Engineering Working Group Meeting 36 Remaining issues: CICE  Parallel I/O  Examine scalability with w-SFC  Active sea-ice ~15% of ocean grid  Estimate for 0.1 degree  RedStorm: ~4000 processors  Blue Gene/L: ~10000 processors Stay Tuned!  Parallel I/O  Examine scalability with w-SFC  Active sea-ice ~15% of ocean grid  Estimate for 0.1 degree  RedStorm: ~4000 processors  Blue Gene/L: ~10000 processors Stay Tuned!

June 22, 2006Software Engineering Working Group Meeting 37 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 38 Community Land Model (CLM2)  Fundamentally a scalable code  No communication between grid-points  Has some serial components….  River Transport Model (RTM)  Serial I/O (Collect on processor 0)  Fundamentally a scalable code  No communication between grid-points  Has some serial components….  River Transport Model (RTM)  Serial I/O (Collect on processor 0)

39 What is wrong with just a little serial code? Serial code is Evil!!

June 22, 2006Software Engineering Working Group Meeting 40 Why is Serial code Evil?  Seems innocent at first  Lead to much larger problems  Serial code:  Performance bottleneck to code  Excessive memory usage  Collecting stuff on one processor  Message passing information  Seems innocent at first  Lead to much larger problems  Serial code:  Performance bottleneck to code  Excessive memory usage  Collecting stuff on one processor  Message passing information

June 22, 2006Software Engineering Working Group Meeting 41 Cost of message passing information  Parallel code:  Each processor only communicates with small number of neighbors  O(1) information  Single serial component:  One processor communicates with all procesors  O(npes) information  Parallel code:  Each processor only communicates with small number of neighbors  O(1) information  Single serial component:  One processor communicates with all procesors  O(npes) information

June 22, 2006Software Engineering Working Group Meeting 42 Memory usage in subroutine: initDecomp  Four integer arrays:  dimension(ancells,npes)  ancells: number of land sub grid points (~20,000)  On 128 processors: 4*4*128*20,000 = 39 Mbytes/per processor  On 1024 processors: 4*4*1024*20,000 = 312 Mbytes/per processor  On 10,000 processors: 4*4*10,000*20,000 = 2.98 Gbytes/per processor -> 29 Tbytes across entire system  Four integer arrays:  dimension(ancells,npes)  ancells: number of land sub grid points (~20,000)  On 128 processors: 4*4*128*20,000 = 39 Mbytes/per processor  On 1024 processors: 4*4*1024*20,000 = 312 Mbytes/per processor  On 10,000 processors: 4*4*10,000*20,000 = 2.98 Gbytes/per processor -> 29 Tbytes across entire system

June 22, 2006Software Engineering Working Group Meeting 43 Memory use in CLM  Subroutine initDecomp deallocates large arrays  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on BG/L processors  Subroutine initDecomp deallocates large arrays  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on BG/L processors

June 22, 2006Software Engineering Working Group Meeting 44 Memory use for CLM on BG/L

June 22, 2006Software Engineering Working Group Meeting 45 Non-scalable memory usage  Common problem  Easy to ignore on 128 processors  Fatal on large processor counts  Avoid array dimension with  npes  Fixed size  Eliminate serial code!!  Re-evaluate initialization code (scalable?)  Remember: Innocent looking non-scalable code can kill!  Common problem  Easy to ignore on 128 processors  Fatal on large processor counts  Avoid array dimension with  npes  Fixed size  Eliminate serial code!!  Re-evaluate initialization code (scalable?)  Remember: Innocent looking non-scalable code can kill!

June 22, 2006Software Engineering Working Group Meeting 46 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

June 22, 2006Software Engineering Working Group Meeting 47 CAM + Coupler  CAM  Extensive benchmarking [P. Worley]  Generalizing interface for modular dynamics  Non lat-lon grids [B. Eaton]  Quasi uniform grids (cubed-sphere, icoshedral)  Ported to BGL [S. Ghosh]  Required rewrite on I/O  FV-core resolution limited due to memory  Coupler  Will examine single executable concurrent system (Summer 06)  CAM  Extensive benchmarking [P. Worley]  Generalizing interface for modular dynamics  Non lat-lon grids [B. Eaton]  Quasi uniform grids (cubed-sphere, icoshedral)  Ported to BGL [S. Ghosh]  Required rewrite on I/O  FV-core resolution limited due to memory  Coupler  Will examine single executable concurrent system (Summer 06)

June 22, 2006Software Engineering Working Group Meeting 48 A Petascale coupled system  Design principles:  Simple/elegant design  Attention to implementation details  Single executable -> run on any thing vendors provide  minimizes communication hotspots  Concurrent execution creates hotspots  E.G. waste bisection bandwidth by passing fluxes to coupler  Design principles:  Simple/elegant design  Attention to implementation details  Single executable -> run on any thing vendors provide  minimizes communication hotspots  Concurrent execution creates hotspots  E.G. waste bisection bandwidth by passing fluxes to coupler

June 22, 2006Software Engineering Working Group Meeting 49 A Petascale coupled system (con’t)  Sequential execution  Flux interpolation just a boundary exchange  Simplifies cost budget  All components must be scalable  Quasi-uniform grid  Flux interpolation should be communication with small number of nearest neighbors  Minimizes interpolation costs  Sequential execution  Flux interpolation just a boundary exchange  Simplifies cost budget  All components must be scalable  Quasi-uniform grid  Flux interpolation should be communication with small number of nearest neighbors  Minimizes interpolation costs

June 22, 2006Software Engineering Working Group Meeting 50 Possible Configuration  CAM (100 km, L66)  0.1 degree  Demonstrated 30 seconds per day  0.1 degree  Land model (50 km)  Sequential Coupler  CAM (100 km, L66)  0.1 degree  Demonstrated 30 seconds per day  0.1 degree  Land model (50 km)  Sequential Coupler

June 22, 2006Software Engineering Working Group Meeting 51 High-Resolution CCSM on ~30K BG/L processors Time per day (secs) demonstratedBudgetActual [03/29/06] [Summer 06]8 Land (50 km) No [Summer 06]5 Atm + Chem (100 km) No [Fall 06]77 CouplerNo [Fall 06]10 TotalNo [Spring 07]130 ~1.8 years/wallclock day

June 22, 2006Software Engineering Working Group Meeting 52 Conclusions  Examine scalability of several components on BG/L  Stress limits of resolution and processor count  Uncover problems in code  Is possible to use large # proc 0.1  Results obtain by modifying ~9 files  BGW 30K proc --> 7.9 years/wallclock day  33% of time in 3D-update -> CICE boundary exchange  RedStorm 8K proc--> 8.1 years/wallclock day  50% of time in solver -> use preconditioner  Examine scalability of several components on BG/L  Stress limits of resolution and processor count  Uncover problems in code  Is possible to use large # proc 0.1  Results obtain by modifying ~9 files  BGW 30K proc --> 7.9 years/wallclock day  33% of time in 3D-update -> CICE boundary exchange  RedStorm 8K proc--> 8.1 years/wallclock day  50% of time in solver -> use preconditioner

June 22, 2006Software Engineering Working Group Meeting 53 Conclusions (con’t)  CICE needs  Improved load-balancing (w-SFC)  CLM needs  Parallelize RTM, I/O  Cleanup non-scalable data structures  Common Issues:  Focus on returning advances into models  Vector mods in POP?  Parallel I/O in CAM?  High-resolution CRIEPI work?  Parallel I/O  Eliminate all serial code!  Watch the memory usage  CICE needs  Improved load-balancing (w-SFC)  CLM needs  Parallelize RTM, I/O  Cleanup non-scalable data structures  Common Issues:  Focus on returning advances into models  Vector mods in POP?  Parallel I/O in CAM?  High-resolution CRIEPI work?  Parallel I/O  Eliminate all serial code!  Watch the memory usage

June 22, 2006Software Engineering Working Group Meeting 54 Conclusions (con’t)  Efficient use of Petascale system is possible!  Path to Petascale computing: 1.Test the limits of our codes 2.Fix resulting problems 3.Goto 1.  Efficient use of Petascale system is possible!  Path to Petascale computing: 1.Test the limits of our codes 2.Fix resulting problems 3.Goto 1.

June 22, 2006Software Engineering Working Group Meeting 55 Acknowledgements/Questions?  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Computer Time:  Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) LLNL  RedStorm time: Sandia et

June 22, 2006Software Engineering Working Group Meeting 56 eta1_local=0.0D0 do i=1,nActive Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r) enddo Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n) ! ! update conjugate direction vector s ! if(lprecond) call update_halo(Z) eta1 = global_sum(eta1_local,distrb_tropic) cg_beta = eta1/eta0 do i=1,n S(i) = Z(i) + S(i)*cg_beta enddo call matvec(n,A,Q,S) ! ! compute next solution and residual ! call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo eta1_local=0.0D0 do i=1,nActive Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r) enddo Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n) ! ! update conjugate direction vector s ! if(lprecond) call update_halo(Z) eta1 = global_sum(eta1_local,distrb_tropic) cg_beta = eta1/eta0 do i=1,n S(i) = Z(i) + S(i)*cg_beta enddo call matvec(n,A,Q,S) ! ! compute next solution and residual ! call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo

June 22, 2006Software Engineering Working Group Meeting 57 do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock) elsewhere WORK1(:,:,iblock) = c0 endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop ! ! update conjugate direction vector s ! if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0) ! ! compute As ! call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop ! ! compute next solution and residual ! call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock) elsewhere WORK1(:,:,iblock) = c0 endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop ! ! update conjugate direction vector s ! if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0) ! ! compute As ! call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop ! ! compute next solution and residual ! call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop

June 22, 2006Software Engineering Working Group Meeting 58 Piece 1D data structure solver ! ! compute next solution and residual ! call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo ! ! compute next solution and residual ! call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo Update Halo Dot product Update vectors ret

June 22, 2006Software Engineering Working Group Meeting 59 POP 0.1 degree blocksizeNbNb 2 Max || 36x x x x x x Increasing || --> Decreasing overhead -->

June 22, 2006Software Engineering Working Group Meeting 60 Serial Execution time on Multiple platforms (test)

June 22, 2006Software Engineering Working Group Meeting 61 The Unexpected Problem: Just because your code scales to N processors, does not mean it will scale to k*N, where k>=4.