2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007.

Slides:



Advertisements
Similar presentations
1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program John M. Dennis: Elizabeth.
Advertisements

© Crown copyright Met Office Met Office Unified Model I/O Server Paul Selwood.
Computation of High-Resolution Global Ocean Model using Earth Simulator By Norikazu Nakashiki (CRIEPI) Yoshikatsu Yoshida (CRIEPI) Takaki Tsubono (CRIEPI)
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Global Climate Modeling Research John Drake Computational Climate Dynamics Group Computer.
Introduction to Operating Systems CS-2301 B-term Introduction to Operating Systems CS-2301, System Programming for Non-majors (Slides include materials.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT Adoption and field tests of M.I.T General Circulation Model (MITgcm) with ESMF Chris Hill ESMF.
NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT | U MICH First Field Tests of ESMF GMAO Seasonal Forecast NCAR/LANL CCSM NCEP.
11/18/08 1 An Inconvenient Question: Are We Going to Get the Algorithms and Computing Technology We Need to Make Critical Climate Predictions in Time?
ESMF Development Status and Plans ESMF 4 th Community Meeting Cecelia DeLuca July 21, 2005 Climate Data Assimilation Weather.
CESM/RACM/RASM Update May 15, Since Nov, 2011 ccsm4_0_racm28:racm29:racm30 – vic parallelization – vic netcdf files – vic coupling mods and “273.15”
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
PIO: The Parallel I/O Library The 13 th Annual CCSM Workshop, June 19, 2008 Raymond Loy Leadership Computing Facility / Mathematics and Computer Science.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
John Dennis ENES Workshop on HPC for Climate Models 1.
ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.
CSEG Update Mariana Vertenstein CCSM Software Engineering Group Mariana Vertenstein CCSM Software Engineering Group.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
CESM/ESMF Progress Report Mariana Vertenstein NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
PetaApps: Update on software engineering and performance J. Dennis M. Vertenstein N. Hearn.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
Earth System Modeling Framework Status Cecelia DeLuca NOAA Cooperative Institute for Research in Environmental Sciences University of Colorado, Boulder.
Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Adrianne Middleton National Center for Atmospheric Research Boulder, Colorado CAM T340- Jim Hack Running the Community Climate Simulation Model (CCSM)
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,
1 Scaling CCSM to a Petascale system John M. Dennis: June 22, 2006 John M. Dennis: June 22,
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
12/5/20151 CCSM4 - A Flexible New Infrastructure for Earth System Modeling Mariana Vertenstein NCAR CCSM Software Engineering Group.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR)
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Data Requirements for Climate and Carbon Research John Drake, Climate Dynamics Group Computer.
Presented by Ricky A. Kendall Scientific Computing and Workflows National Institute for Computational Sciences Applications National Institute for Computational.
Ultra-High-Resolution Coupled Climate Simulations: A LLNL Grand Challenge Computational Project Davd Bader (LLNL), Julie McClean (SIO/LLNL), Art Mirin.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Mirin – AMWG 2006 – Slide 1 Coupled Finite-Volume Simulations at One-Degree Resolution Art Mirin and Govindasamy Bala Lawrence Livermore National Laboratory.
1 A simple parallel algorithm Adding n numbers in parallel.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
SciDAC CCSM Consortium: Software Engineering Update Patrick Worley Oak Ridge National Laboratory (On behalf of all the consorts) Software Engineering Working.
Overview of the CCSM CCSM Software Engineering Group June
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Programming By J. H. Wang May 2, 2017.
Software Practices for a Performance Portable Climate System Model
Parallel Programming in C with MPI and OpenMP
Mariana Vertenstein CCSM Software Engineering Group NCAR
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007

Software Engineering Working Group Meeting3 Intel “Tera Chip”  80 core chip  1 Teraflop  3.16 Ghz / 0.95V / 62W  Process  45 nm technology  High-K  2D mesh network  Each processor has 5- port router  Connects to “3D- memory”

March 16, 2007Software Engineering Working Group Meeting4 Outline  Chip-Multiprocessor  Parallel I/O library (PIO)  Full with Large Processor Counts  POP  CICE  Chip-Multiprocessor  Parallel I/O library (PIO)  Full with Large Processor Counts  POP  CICE

March 16, 2007Software Engineering Working Group Meeting5 Moore’s Law  Most things are twice as nice [18 months]  Transistor count  Processor speed  DRAM density  Historical Result:  Solve problem twice as large in same time  Solve same size problem in half the time --> Inactivity leads to progress!  Most things are twice as nice [18 months]  Transistor count  Processor speed  DRAM density  Historical Result:  Solve problem twice as large in same time  Solve same size problem in half the time --> Inactivity leads to progress!

6 The advent of Chip- multiprocessors Moore’s Law gone bad!

March 16, 2007Software Engineering Working Group Meeting7 New implications of Moore’s Law  Every 18 months  # of cores per socket doubles  Memory density doubles  Clock cycle may increase slightly  18 months from now  8 cores per socket  Slight increase in clock cycle (~15%)  Same memory per core!!  Every 18 months  # of cores per socket doubles  Memory density doubles  Clock cycle may increase slightly  18 months from now  8 cores per socket  Slight increase in clock cycle (~15%)  Same memory per core!!

March 16, 2007Software Engineering Working Group Meeting8 New implications of Moore’s Law (con’t)  Inactivity leads to no progress!  Possible outcome  Same problem size / same parallelism  solve problem ~15% faster  Bigger problem size  scalable memory?  More processors enable ~2x reduction in time to solution  Non-scalable memory?  May limit number of processors that can be used  Waste 1/2 of cores on sockets to use memory?  All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself!  Inactivity leads to no progress!  Possible outcome  Same problem size / same parallelism  solve problem ~15% faster  Bigger problem size  scalable memory?  More processors enable ~2x reduction in time to solution  Non-scalable memory?  May limit number of processors that can be used  Waste 1/2 of cores on sockets to use memory?  All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself!

March 16, 2007Software Engineering Working Group Meeting9 Questions ?

10 Parallel I/O library (PIO) John Dennis Ray Loy March 16, 2007 John Dennis Ray Loy March 16, 2007

Software Engineering Working Group Meeting11 Introduction  All component models need parallel I/O  Serial I/O is bad!  Increased memory requirement  Typically negative impact on performance  Primary Developers: [J. Dennis, R. Loy]  Necessary for POP BGW runs  All component models need parallel I/O  Serial I/O is bad!  Increased memory requirement  Typically negative impact on performance  Primary Developers: [J. Dennis, R. Loy]  Necessary for POP BGW runs

March 16, 2007Software Engineering Working Group Meeting12 Design goals  Provide parallel I/O for all component models  Encapsulate complexity into library  Simple interface for component developers to implement  Provide parallel I/O for all component models  Encapsulate complexity into library  Simple interface for component developers to implement

March 16, 2007Software Engineering Working Group Meeting13 Design goals (con’t)  Extensible for future I/O technology  Backward compatible (node=0)  Support for multiple formats  {sequential,direct} binary  netcdf  Preserve format of input/output files  Supports 1D, 2D and 3D arrays  Currently XY  Extensible to XZ or YZ  Extensible for future I/O technology  Backward compatible (node=0)  Support for multiple formats  {sequential,direct} binary  netcdf  Preserve format of input/output files  Supports 1D, 2D and 3D arrays  Currently XY  Extensible to XZ or YZ

March 16, 2007Software Engineering Working Group Meeting14 Terms and Concepts  PnetCDF: [ANL]  High performance I/O  Different interface  Stable  netCDF4 + HDF5 [NCSA]  Same interface  Needs HDF5 library  Less stable  Lower performance  No support on Blue Gene  PnetCDF: [ANL]  High performance I/O  Different interface  Stable  netCDF4 + HDF5 [NCSA]  Same interface  Needs HDF5 library  Less stable  Lower performance  No support on Blue Gene

March 16, 2007Software Engineering Working Group Meeting15 Terms and Concepts (con’t)  Processor stride:  Allows matching of subset of MPI IO nodes to system hardware  Processor stride:  Allows matching of subset of MPI IO nodes to system hardware

March 16, 2007Software Engineering Working Group Meeting16 Terms and Concepts (con’t)  IO decomp vs. COMP decomp  IO decomp == COMP decomp  MPI-IO + message aggregation  IO decomp != COMP decomp  Need Rearranger : MCT  No component specific info in library  Pair with existing communication tech  1D arrays in library; component must flatten 2D and 3D arrays  IO decomp vs. COMP decomp  IO decomp == COMP decomp  MPI-IO + message aggregation  IO decomp != COMP decomp  Need Rearranger : MCT  No component specific info in library  Pair with existing communication tech  1D arrays in library; component must flatten 2D and 3D arrays

March 16, 2007Software Engineering Working Group Meeting17 Component Model ‘issues’  POP & CICE:  Missing blocks  Update of neighbors halo  Who writes missing blocks?  Asymmetry between read/write  ‘sub-block’ decompositions not rectangular  CLM  Decomposition not rectangular  Who writes missing data?  POP & CICE:  Missing blocks  Update of neighbors halo  Who writes missing blocks?  Asymmetry between read/write  ‘sub-block’ decompositions not rectangular  CLM  Decomposition not rectangular  Who writes missing data?

March 16, 2007Software Engineering Working Group Meeting18 What works  Binary I/O [direct]  Test on POWER5, BGL  Rearrange w/MCT + MPI-IO  MPI-IO no rearrangement  netCDF  Rearrange with MCT [New]  Reduced memory  PnetCDF:  Rearrange with MCT  No rearrangement  Test on POWER5, BGL  Binary I/O [direct]  Test on POWER5, BGL  Rearrange w/MCT + MPI-IO  MPI-IO no rearrangement  netCDF  Rearrange with MCT [New]  Reduced memory  PnetCDF:  Rearrange with MCT  No rearrangement  Test on POWER5, BGL

March 16, 2007Software Engineering Working Group Meeting19 What works (con’t)  Prototype added to POP2  Reads restart and forcing files correctly  Writes binary restart files correctly  Necessary for BGW runs  Prototype implementation in HOMME [J. Edwards]  Writes netCDF history files correctly  POPIO benchmark  2D array [3600x2400] (70 Mbyte)  Test code for correctness and performance  Tested on 30K BGL processors in Oct 06  Performance  POWER5: 2-3x serial I/O approach  BGL: mixed  Prototype added to POP2  Reads restart and forcing files correctly  Writes binary restart files correctly  Necessary for BGW runs  Prototype implementation in HOMME [J. Edwards]  Writes netCDF history files correctly  POPIO benchmark  2D array [3600x2400] (70 Mbyte)  Test code for correctness and performance  Tested on 30K BGL processors in Oct 06  Performance  POWER5: 2-3x serial I/O approach  BGL: mixed

March 16, 2007Software Engineering Working Group Meeting20 Complexity / Remaining Issues  Mulitple ways to express decomposition  GDOF: global degree of freedom --> (MCT, MPI- IO)  Subarrays: start + count (pNetCDF)  Limited expressiveness  Will not support ‘sub-block’ in POP & CICE, CLM  Need common language for interface  Interface between component model and library  Mulitple ways to express decomposition  GDOF: global degree of freedom --> (MCT, MPI- IO)  Subarrays: start + count (pNetCDF)  Limited expressiveness  Will not support ‘sub-block’ in POP & CICE, CLM  Need common language for interface  Interface between component model and library

March 16, 2007Software Engineering Working Group Meeting21 Conclusions  Working prototype  POP2 for binary I/O  HOMME for netCDF  PIO telecon: discuss progress every 2 weeks  Work in progress  Multiple efforts underway  accepting help   In CCSM subversion repository  Working prototype  POP2 for binary I/O  HOMME for netCDF  PIO telecon: discuss progress every 2 weeks  Work in progress  Multiple efforts underway  accepting help   In CCSM subversion repository

22 Fun with Large Processor Counts: POP, CICE John Dennis March 16, 2007 John Dennis March 16, 2007

Software Engineering Working Group Meeting23 Motivation  Can Community Climate System Model (CCSM) be a Petascale Application?  Use K processors per simulation  Increasing common access to large systems  ORNL Cray XT3/4 : 20K [2-3 weeks]  ANL Blue Gene/P : 160K [Jan 2008]  TACC Sun : 55K [Jan 2008]  Petascale for the masses ?  lagtime in Top 500 List [4-5 years] NCAR before 2015  Can Community Climate System Model (CCSM) be a Petascale Application?  Use K processors per simulation  Increasing common access to large systems  ORNL Cray XT3/4 : 20K [2-3 weeks]  ANL Blue Gene/P : 160K [Jan 2008]  TACC Sun : 55K [Jan 2008]  Petascale for the masses ?  lagtime in Top 500 List [4-5 years] NCAR before 2015

March 16, 2007Software Engineering Working Group Meeting24 Outline  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE

March 16, 2007Software Engineering Working Group Meeting25 Status of POP  Access to 17K Cray XT4 processors  12.5 years/day [Current Record]  70% of time in solver  Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]  110 Rack Days/ 5.4M CPU hours  20 year 0.1° POP simulation  Includes a suite of dye-like tracers  Simulate eddy diffusivity tensor  Access to 17K Cray XT4 processors  12.5 years/day [Current Record]  70% of time in solver  Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock]  110 Rack Days/ 5.4M CPU hours  20 year 0.1° POP simulation  Includes a suite of dye-like tracers  Simulate eddy diffusivity tensor

March 16, 2007Software Engineering Working Group Meeting26 Status of POP (con’t)  Allocation will occur over ~7 days  Run in production on 30K processors  Needs Parallel I/O to write history file  Start runs in 4-6 weeks  Allocation will occur over ~7 days  Run in production on 30K processors  Needs Parallel I/O to write history file  Start runs in 4-6 weeks

March 16, 2007Software Engineering Working Group Meeting27 Outline  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE  Chip-Multiprocessor  Parallel I/O library (PIO)  Fun with Large Processor Counts  POP  CICE

March 16, 2007Software Engineering Working Group Meeting28 Status of CICE  Tested 1/10   10K Cray XT4 processors  40K IBM Blue Gene processors [BGW days]  Use weighted space-filling curves (wSFC)  erfc  climatology  Tested 1/10   10K Cray XT4 processors  40K IBM Blue Gene processors [BGW days]  Use weighted space-filling curves (wSFC)  erfc  climatology

March 16, 2007Software Engineering Working Group Meeting29 POP (gx1v3) + Space-filling curve

March 16, 2007Software Engineering Working Group Meeting30 Space-filling curve partition for 8 processors

March 16, 2007Software Engineering Working Group Meeting31 Weighted Space-filling curves  Estimate work for each grid block Work i = w 0 + P i *w 1 where: w 0 : Fixed work for all blocks w 1 : Work if block contains Sea-ice P i :Probability block contains Sea-ice  For our experiments:w 0 = 2, w 1 = 10  Estimate work for each grid block Work i = w 0 + P i *w 1 where: w 0 : Fixed work for all blocks w 1 : Work if block contains Sea-ice P i :Probability block contains Sea-ice  For our experiments:w 0 = 2, w 1 = 10

March 16, 2007Software Engineering Working Group Meeting32 Probability Function  Error Function: P i = erfc((  -max(|lat i |))/  ) where: lat i max lat in block i  mean sea-ice extent  variance in sea-ice extent  NH =70°,  SH =60°,  =5 °  Error Function: P i = erfc((  -max(|lat i |))/  ) where: lat i max lat in block i  mean sea-ice extent  variance in sea-ice extent  NH =70°,  SH =60°,  =5 °

March 16, 2007Software Engineering Working Group Meeting33 1° CICE4 on 20 processors Small high latitudes Large low latitudes

March 16, 2007Software Engineering Working Group Meeting ° CICE4  Developed at LANL  Finite Difference  Models sea-ice  Shares grid and infrastructure with POP  Reuse techniques from POP work  Computational grid: [3600 x 2400 x 20]  Computational load-imbalance creates problems:  ~15% of grid has sea-ice  Use weighted Space-filling curves?  Evaluate using Benchmark:  1 day/ Initial run / 30 minute timestep / no Forcing  Developed at LANL  Finite Difference  Models sea-ice  Shares grid and infrastructure with POP  Reuse techniques from POP work  Computational grid: [3600 x 2400 x 20]  Computational load-imbalance creates problems:  ~15% of grid has sea-ice  Use weighted Space-filling curves?  Evaluate using Benchmark:  1 day/ Initial run / 30 minute timestep / no Forcing

March 16, 2007Software Engineering Working Group Meeting35 0.1°

March 16, 2007Software Engineering Working Group Meeting36 Timings for 1°,npes=160,  NH =70° Load-imbalance: Hudson Bay south of 70°

March 16, 2007Software Engineering Working Group Meeting37 Timings for 1°,npes=160,  NH =55°

March 16, 2007Software Engineering Working Group Meeting38 Better Probability Function  Climatological Function: Where:  ij climatological maximum sea-ice extent [satellite observation] n i is the number of points within block i with non-zero  ij  Climatological Function: Where:  ij climatological maximum sea-ice extent [satellite observation] n i is the number of points within block i with non-zero  ij

March 16, 2007Software Engineering Working Group Meeting39 Timings for 1°,npes=160, climate-based Reduces dynamics sub-cycling time by 28%!

March 16, 2007Software Engineering Working Group Meeting40 Acknowledgements/Questions?  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Computer Time:  Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson)  Cray XT3/4 time: ORNL Sandia et

March 16, 2007Software Engineering Working Group Meeting41 Partitioning with Space-filling Curves  Map 2D -> 1D  Variety of sizes  Hilbert (Nb=2 n)  Peano (Nb=3 m)  Cinco (Nb=5 p )  Hilbert-Peano (Nb=2 n 3 m )  Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p )  Partitioning 1D array Nb

March 16, 2007Software Engineering Working Group Meeting42 Scalable data structures  Common problem among applications  WRF  Serial I/O [fixed]  Duplication of lateral boundary values  POP & CICE  Serial I/O  CLM  Serial I/O  Duplication of grid info  Common problem among applications  WRF  Serial I/O [fixed]  Duplication of lateral boundary values  POP & CICE  Serial I/O  CLM  Serial I/O  Duplication of grid info

March 16, 2007Software Engineering Working Group Meeting43 Scalable data structures (con’t)  CAM  Serial I/O  Lookup tables  CPL  Serial I/O  Duplication of grid info Memory footprint problem will not solve itself!  CAM  Serial I/O  Lookup tables  CPL  Serial I/O  Duplication of grid info Memory footprint problem will not solve itself!

March 16, 2007Software Engineering Working Group Meeting44 Remove Land blocks

March 16, 2007Software Engineering Working Group Meeting45 Case Study: Memory use in CLM  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on BG/L processors  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on BG/L processors

March 16, 2007Software Engineering Working Group Meeting46 Memory use of CLM on BGL

March 16, 2007Software Engineering Working Group Meeting47 Motivation (con’t)  Multiple efforts underway  CAM scalability + high resolution coupled simulation [A. Mirin]  Sequential coupler [M. Vertenstein, R. Jacob]  Single executable coupler [J. Wolfe]  CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob]  HOMME in CAM [J. Edwards]  Multiple efforts underway  CAM scalability + high resolution coupled simulation [A. Mirin]  Sequential coupler [M. Vertenstein, R. Jacob]  Single executable coupler [J. Wolfe]  CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob]  HOMME in CAM [J. Edwards]

March 16, 2007Software Engineering Working Group Meeting48 Outline  Chip-Multiprocessor  Fun with Large Processor Counts  POP  CICE  CLM  Parallel I/O library (PIO)  Chip-Multiprocessor  Fun with Large Processor Counts  POP  CICE  CLM  Parallel I/O library (PIO)

March 16, 2007Software Engineering Working Group Meeting49 Status of CLM  Work of T. Craig  Elimination of global memory  Reworking of decomposition algorithms  Addition of PIO  Short term goal:  Participation in BGW days June 07  Investigation scalability at 1/10   Work of T. Craig  Elimination of global memory  Reworking of decomposition algorithms  Addition of PIO  Short term goal:  Participation in BGW days June 07  Investigation scalability at 1/10 

March 16, 2007Software Engineering Working Group Meeting50 Status of CLM memory usage  May 1, 2006:  memory usage increases with processor count  Can run 1x1.25 on processors of BGL  July 10, 2006:  Memory usage scales to asymptote  Can run 1x1.25 on 32- 2K processors of BGL  ~350 persistent global arrays [24 1/10 degree]  January, 2007:  150 persistent global arrays  1/2 degee runs on 32-2K BGL processors  ~150 persistent global arrays [10.5 1/10 degree]  February, 2007:  18 persistent global arrays [1.2 1/10 degree]  Target:  no persistent global arrays  1/10 degree runs on single rack BGL  May 1, 2006:  memory usage increases with processor count  Can run 1x1.25 on processors of BGL  July 10, 2006:  Memory usage scales to asymptote  Can run 1x1.25 on 32- 2K processors of BGL  ~350 persistent global arrays [24 1/10 degree]  January, 2007:  150 persistent global arrays  1/2 degee runs on 32-2K BGL processors  ~150 persistent global arrays [10.5 1/10 degree]  February, 2007:  18 persistent global arrays [1.2 1/10 degree]  Target:  no persistent global arrays  1/10 degree runs on single rack BGL

March 16, 2007Software Engineering Working Group Meeting51 Proposed Petascale Experiment  Ensemble of 10 runs/200 years  Petascale Configuration:  CAM (30 km, L66)  0.1°  12.5 years / wall-clock day [17K Cray XT4 processors]  0.1°  42 years / wall-clock day [10K Cray XT3 processors  Land 0.1°  Sequential Design (105 days per run)  32K BGL/ 10K XT3 processors  Concurrent Design (33 days per run)  120K BGL / 42K XT3 processors  Ensemble of 10 runs/200 years  Petascale Configuration:  CAM (30 km, L66)  0.1°  12.5 years / wall-clock day [17K Cray XT4 processors]  0.1°  42 years / wall-clock day [10K Cray XT3 processors  Land 0.1°  Sequential Design (105 days per run)  32K BGL/ 10K XT3 processors  Concurrent Design (33 days per run)  120K BGL / 42K XT3 processors

March 16, 2007Software Engineering Working Group Meeting52 POPIO benchmark on BGW

March 16, 2007Software Engineering Working Group Meeting53 CICE results (con’t)  Correct weighting increases simulation rate  wSFC works best for high resolution  Variable sized domains:  Large domains at low latitude -> higher boundary exchange cost  Small domains at high latitude -> lower floating-point cost  Optimal balance of computational and communication cost? Work in progress!  Correct weighting increases simulation rate  wSFC works best for high resolution  Variable sized domains:  Large domains at low latitude -> higher boundary exchange cost  Small domains at high latitude -> lower floating-point cost  Optimal balance of computational and communication cost? Work in progress!