Download presentation
Presentation is loading. Please wait.
Published byAnis Philippa Spencer Modified over 9 years ago
1
2 Chip-Multiprocessors & You John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu March 16, 2007
2
Software Engineering Working Group Meeting3 Intel “Tera Chip” 80 core chip 1 Teraflop 3.16 Ghz / 0.95V / 62W Process 45 nm technology High-K 2D mesh network Each processor has 5- port router Connects to “3D- memory”
3
March 16, 2007Software Engineering Working Group Meeting4 Outline Chip-Multiprocessor Parallel I/O library (PIO) Full with Large Processor Counts POP CICE Chip-Multiprocessor Parallel I/O library (PIO) Full with Large Processor Counts POP CICE
4
March 16, 2007Software Engineering Working Group Meeting5 Moore’s Law Most things are twice as nice [18 months] Transistor count Processor speed DRAM density Historical Result: Solve problem twice as large in same time Solve same size problem in half the time --> Inactivity leads to progress! Most things are twice as nice [18 months] Transistor count Processor speed DRAM density Historical Result: Solve problem twice as large in same time Solve same size problem in half the time --> Inactivity leads to progress!
5
6 The advent of Chip- multiprocessors Moore’s Law gone bad!
6
March 16, 2007Software Engineering Working Group Meeting7 New implications of Moore’s Law Every 18 months # of cores per socket doubles Memory density doubles Clock cycle may increase slightly 18 months from now 8 cores per socket Slight increase in clock cycle (~15%) Same memory per core!! Every 18 months # of cores per socket doubles Memory density doubles Clock cycle may increase slightly 18 months from now 8 cores per socket Slight increase in clock cycle (~15%) Same memory per core!!
7
March 16, 2007Software Engineering Working Group Meeting8 New implications of Moore’s Law (con’t) Inactivity leads to no progress! Possible outcome Same problem size / same parallelism solve problem ~15% faster Bigger problem size scalable memory? More processors enable ~2x reduction in time to solution Non-scalable memory? May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory? All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself! Inactivity leads to no progress! Possible outcome Same problem size / same parallelism solve problem ~15% faster Bigger problem size scalable memory? More processors enable ~2x reduction in time to solution Non-scalable memory? May limit number of processors that can be used Waste 1/2 of cores on sockets to use memory? All components of application must scale to benefit from Moore’s Law increases! Memory footprint problem will not solve itself!
8
March 16, 2007Software Engineering Working Group Meeting9 Questions ?
9
10 Parallel I/O library (PIO) John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov)rloy@mcs.anl.gov March 16, 2007 John Dennis (dennis@ucar.edu) Ray Loy (rloy@mcs.anl.gov)rloy@mcs.anl.gov March 16, 2007
10
Software Engineering Working Group Meeting11 Introduction All component models need parallel I/O Serial I/O is bad! Increased memory requirement Typically negative impact on performance Primary Developers: [J. Dennis, R. Loy] Necessary for POP BGW runs All component models need parallel I/O Serial I/O is bad! Increased memory requirement Typically negative impact on performance Primary Developers: [J. Dennis, R. Loy] Necessary for POP BGW runs
11
March 16, 2007Software Engineering Working Group Meeting12 Design goals Provide parallel I/O for all component models Encapsulate complexity into library Simple interface for component developers to implement Provide parallel I/O for all component models Encapsulate complexity into library Simple interface for component developers to implement
12
March 16, 2007Software Engineering Working Group Meeting13 Design goals (con’t) Extensible for future I/O technology Backward compatible (node=0) Support for multiple formats {sequential,direct} binary netcdf Preserve format of input/output files Supports 1D, 2D and 3D arrays Currently XY Extensible to XZ or YZ Extensible for future I/O technology Backward compatible (node=0) Support for multiple formats {sequential,direct} binary netcdf Preserve format of input/output files Supports 1D, 2D and 3D arrays Currently XY Extensible to XZ or YZ
13
March 16, 2007Software Engineering Working Group Meeting14 Terms and Concepts PnetCDF: [ANL] High performance I/O Different interface Stable netCDF4 + HDF5 [NCSA] Same interface Needs HDF5 library Less stable Lower performance No support on Blue Gene PnetCDF: [ANL] High performance I/O Different interface Stable netCDF4 + HDF5 [NCSA] Same interface Needs HDF5 library Less stable Lower performance No support on Blue Gene
14
March 16, 2007Software Engineering Working Group Meeting15 Terms and Concepts (con’t) Processor stride: Allows matching of subset of MPI IO nodes to system hardware Processor stride: Allows matching of subset of MPI IO nodes to system hardware
15
March 16, 2007Software Engineering Working Group Meeting16 Terms and Concepts (con’t) IO decomp vs. COMP decomp IO decomp == COMP decomp MPI-IO + message aggregation IO decomp != COMP decomp Need Rearranger : MCT No component specific info in library Pair with existing communication tech 1D arrays in library; component must flatten 2D and 3D arrays IO decomp vs. COMP decomp IO decomp == COMP decomp MPI-IO + message aggregation IO decomp != COMP decomp Need Rearranger : MCT No component specific info in library Pair with existing communication tech 1D arrays in library; component must flatten 2D and 3D arrays
16
March 16, 2007Software Engineering Working Group Meeting17 Component Model ‘issues’ POP & CICE: Missing blocks Update of neighbors halo Who writes missing blocks? Asymmetry between read/write ‘sub-block’ decompositions not rectangular CLM Decomposition not rectangular Who writes missing data? POP & CICE: Missing blocks Update of neighbors halo Who writes missing blocks? Asymmetry between read/write ‘sub-block’ decompositions not rectangular CLM Decomposition not rectangular Who writes missing data?
17
March 16, 2007Software Engineering Working Group Meeting18 What works Binary I/O [direct] Test on POWER5, BGL Rearrange w/MCT + MPI-IO MPI-IO no rearrangement netCDF Rearrange with MCT [New] Reduced memory PnetCDF: Rearrange with MCT No rearrangement Test on POWER5, BGL Binary I/O [direct] Test on POWER5, BGL Rearrange w/MCT + MPI-IO MPI-IO no rearrangement netCDF Rearrange with MCT [New] Reduced memory PnetCDF: Rearrange with MCT No rearrangement Test on POWER5, BGL
18
March 16, 2007Software Engineering Working Group Meeting19 What works (con’t) Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06 Performance POWER5: 2-3x serial I/O approach BGL: mixed Prototype added to POP2 Reads restart and forcing files correctly Writes binary restart files correctly Necessary for BGW runs Prototype implementation in HOMME [J. Edwards] Writes netCDF history files correctly POPIO benchmark 2D array [3600x2400] (70 Mbyte) Test code for correctness and performance Tested on 30K BGL processors in Oct 06 Performance POWER5: 2-3x serial I/O approach BGL: mixed
19
March 16, 2007Software Engineering Working Group Meeting20 Complexity / Remaining Issues Mulitple ways to express decomposition GDOF: global degree of freedom --> (MCT, MPI- IO) Subarrays: start + count (pNetCDF) Limited expressiveness Will not support ‘sub-block’ in POP & CICE, CLM Need common language for interface Interface between component model and library Mulitple ways to express decomposition GDOF: global degree of freedom --> (MCT, MPI- IO) Subarrays: start + count (pNetCDF) Limited expressiveness Will not support ‘sub-block’ in POP & CICE, CLM Need common language for interface Interface between component model and library
20
March 16, 2007Software Engineering Working Group Meeting21 Conclusions Working prototype POP2 for binary I/O HOMME for netCDF PIO telecon: discuss progress every 2 weeks Work in progress Multiple efforts underway accepting help http://swiki.ucar.edu/ccsm/93 http://swiki.ucar.edu/ccsm/93 In CCSM subversion repository Working prototype POP2 for binary I/O HOMME for netCDF PIO telecon: discuss progress every 2 weeks Work in progress Multiple efforts underway accepting help http://swiki.ucar.edu/ccsm/93 http://swiki.ucar.edu/ccsm/93 In CCSM subversion repository
21
22 Fun with Large Processor Counts: POP, CICE John Dennis dennis@ucar.edu March 16, 2007 John Dennis dennis@ucar.edu March 16, 2007
22
Software Engineering Working Group Meeting23 Motivation Can Community Climate System Model (CCSM) be a Petascale Application? Use 10-100K processors per simulation Increasing common access to large systems ORNL Cray XT3/4 : 20K [2-3 weeks] ANL Blue Gene/P : 160K [Jan 2008] TACC Sun : 55K [Jan 2008] Petascale for the masses ? lagtime in Top 500 List [4-5 years] @ NCAR before 2015 Can Community Climate System Model (CCSM) be a Petascale Application? Use 10-100K processors per simulation Increasing common access to large systems ORNL Cray XT3/4 : 20K [2-3 weeks] ANL Blue Gene/P : 160K [Jan 2008] TACC Sun : 55K [Jan 2008] Petascale for the masses ? lagtime in Top 500 List [4-5 years] @ NCAR before 2015
23
March 16, 2007Software Engineering Working Group Meeting24 Outline Chip-Multiprocessor Parallel I/O library (PIO) Fun with Large Processor Counts POP CICE Chip-Multiprocessor Parallel I/O library (PIO) Fun with Large Processor Counts POP CICE
24
March 16, 2007Software Engineering Working Group Meeting25 Status of POP Access to 17K Cray XT4 processors 12.5 years/day [Current Record] 70% of time in solver Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock] 110 Rack Days/ 5.4M CPU hours 20 year 0.1° POP simulation Includes a suite of dye-like tracers Simulate eddy diffusivity tensor Access to 17K Cray XT4 processors 12.5 years/day [Current Record] 70% of time in solver Won BGW cycle allocation Eddy Stirring: The Missing Ingredient in Nailing Down Ocean Tracer Transport [J. Dennis, F. Bryan, B. Fox-Kemper, M. Maltrud, J. McClean, S. Peacock] 110 Rack Days/ 5.4M CPU hours 20 year 0.1° POP simulation Includes a suite of dye-like tracers Simulate eddy diffusivity tensor
25
March 16, 2007Software Engineering Working Group Meeting26 Status of POP (con’t) Allocation will occur over ~7 days Run in production on 30K processors Needs Parallel I/O to write history file Start runs in 4-6 weeks Allocation will occur over ~7 days Run in production on 30K processors Needs Parallel I/O to write history file Start runs in 4-6 weeks
26
March 16, 2007Software Engineering Working Group Meeting27 Outline Chip-Multiprocessor Parallel I/O library (PIO) Fun with Large Processor Counts POP CICE Chip-Multiprocessor Parallel I/O library (PIO) Fun with Large Processor Counts POP CICE
27
March 16, 2007Software Engineering Working Group Meeting28 Status of CICE Tested CICE @ 1/10 10K Cray XT4 processors 40K IBM Blue Gene processors [BGW days] Use weighted space-filling curves (wSFC) erfc climatology Tested CICE @ 1/10 10K Cray XT4 processors 40K IBM Blue Gene processors [BGW days] Use weighted space-filling curves (wSFC) erfc climatology
28
March 16, 2007Software Engineering Working Group Meeting29 POP (gx1v3) + Space-filling curve
29
March 16, 2007Software Engineering Working Group Meeting30 Space-filling curve partition for 8 processors
30
March 16, 2007Software Engineering Working Group Meeting31 Weighted Space-filling curves Estimate work for each grid block Work i = w 0 + P i *w 1 where: w 0 : Fixed work for all blocks w 1 : Work if block contains Sea-ice P i :Probability block contains Sea-ice For our experiments:w 0 = 2, w 1 = 10 Estimate work for each grid block Work i = w 0 + P i *w 1 where: w 0 : Fixed work for all blocks w 1 : Work if block contains Sea-ice P i :Probability block contains Sea-ice For our experiments:w 0 = 2, w 1 = 10
31
March 16, 2007Software Engineering Working Group Meeting32 Probability Function Error Function: P i = erfc(( -max(|lat i |))/ ) where: lat i max lat in block i mean sea-ice extent variance in sea-ice extent NH =70°, SH =60°, =5 ° Error Function: P i = erfc(( -max(|lat i |))/ ) where: lat i max lat in block i mean sea-ice extent variance in sea-ice extent NH =70°, SH =60°, =5 °
32
March 16, 2007Software Engineering Working Group Meeting33 1° CICE4 on 20 processors Small domains @ high latitudes Large domains @ low latitudes
33
March 16, 2007Software Engineering Working Group Meeting34 0.1 ° CICE4 Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems: ~15% of grid has sea-ice Use weighted Space-filling curves? Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing Developed at LANL Finite Difference Models sea-ice Shares grid and infrastructure with POP Reuse techniques from POP work Computational grid: [3600 x 2400 x 20] Computational load-imbalance creates problems: ~15% of grid has sea-ice Use weighted Space-filling curves? Evaluate using Benchmark: 1 day/ Initial run / 30 minute timestep / no Forcing
34
March 16, 2007Software Engineering Working Group Meeting35 CICE4 @ 0.1°
35
March 16, 2007Software Engineering Working Group Meeting36 Timings for 1°,npes=160, NH =70° Load-imbalance: Hudson Bay south of 70°
36
March 16, 2007Software Engineering Working Group Meeting37 Timings for 1°,npes=160, NH =55°
37
March 16, 2007Software Engineering Working Group Meeting38 Better Probability Function Climatological Function: Where: ij climatological maximum sea-ice extent [satellite observation] n i is the number of points within block i with non-zero ij Climatological Function: Where: ij climatological maximum sea-ice extent [satellite observation] n i is the number of points within block i with non-zero ij
38
March 16, 2007Software Engineering Working Group Meeting39 Timings for 1°,npes=160, climate-based Reduces dynamics sub-cycling time by 28%!
39
March 16, 2007Software Engineering Working Group Meeting40 Acknowledgements/Questions? Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Computer Time: Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) Cray XT3/4 time: ORNL Sandia et
40
March 16, 2007Software Engineering Working Group Meeting41 Partitioning with Space-filling Curves Map 2D -> 1D Variety of sizes Hilbert (Nb=2 n) Peano (Nb=3 m) Cinco (Nb=5 p ) Hilbert-Peano (Nb=2 n 3 m ) Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p ) Partitioning 1D array Nb
41
March 16, 2007Software Engineering Working Group Meeting42 Scalable data structures Common problem among applications WRF Serial I/O [fixed] Duplication of lateral boundary values POP & CICE Serial I/O CLM Serial I/O Duplication of grid info Common problem among applications WRF Serial I/O [fixed] Duplication of lateral boundary values POP & CICE Serial I/O CLM Serial I/O Duplication of grid info
42
March 16, 2007Software Engineering Working Group Meeting43 Scalable data structures (con’t) CAM Serial I/O Lookup tables CPL Serial I/O Duplication of grid info Memory footprint problem will not solve itself! CAM Serial I/O Lookup tables CPL Serial I/O Duplication of grid info Memory footprint problem will not solve itself!
43
March 16, 2007Software Engineering Working Group Meeting44 Remove Land blocks
44
March 16, 2007Software Engineering Working Group Meeting45 Case Study: Memory use in CLM CLM Configuration: 1x1.25 grid No RTM MAXPATCH_PFT = 4 No CN, DGVM Measure stack and heap on 32-512 BG/L processors CLM Configuration: 1x1.25 grid No RTM MAXPATCH_PFT = 4 No CN, DGVM Measure stack and heap on 32-512 BG/L processors
45
March 16, 2007Software Engineering Working Group Meeting46 Memory use of CLM on BGL
46
March 16, 2007Software Engineering Working Group Meeting47 Motivation (con’t) Multiple efforts underway CAM scalability + high resolution coupled simulation [A. Mirin] Sequential coupler [M. Vertenstein, R. Jacob] Single executable coupler [J. Wolfe] CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob] HOMME in CAM [J. Edwards] Multiple efforts underway CAM scalability + high resolution coupled simulation [A. Mirin] Sequential coupler [M. Vertenstein, R. Jacob] Single executable coupler [J. Wolfe] CCSM on Blue Gene [J. Wolfe, R. Loy, R. Jacob] HOMME in CAM [J. Edwards]
47
March 16, 2007Software Engineering Working Group Meeting48 Outline Chip-Multiprocessor Fun with Large Processor Counts POP CICE CLM Parallel I/O library (PIO) Chip-Multiprocessor Fun with Large Processor Counts POP CICE CLM Parallel I/O library (PIO)
48
March 16, 2007Software Engineering Working Group Meeting49 Status of CLM Work of T. Craig Elimination of global memory Reworking of decomposition algorithms Addition of PIO Short term goal: Participation in BGW days June 07 Investigation scalability at 1/10 Work of T. Craig Elimination of global memory Reworking of decomposition algorithms Addition of PIO Short term goal: Participation in BGW days June 07 Investigation scalability at 1/10
49
March 16, 2007Software Engineering Working Group Meeting50 Status of CLM memory usage May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree] January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree] February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree] Target: no persistent global arrays 1/10 degree runs on single rack BGL May 1, 2006: memory usage increases with processor count Can run 1x1.25 on 32-512 processors of BGL July 10, 2006: Memory usage scales to asymptote Can run 1x1.25 on 32- 2K processors of BGL ~350 persistent global arrays [24 Gbytes/proc @ 1/10 degree] January, 2007: 150 persistent global arrays 1/2 degee runs on 32-2K BGL processors ~150 persistent global arrays [10.5 Gbytes/proc @ 1/10 degree] February, 2007: 18 persistent global arrays [1.2 Gbytes/proc @ 1/10 degree] Target: no persistent global arrays 1/10 degree runs on single rack BGL
50
March 16, 2007Software Engineering Working Group Meeting51 Proposed Petascale Experiment Ensemble of 10 runs/200 years Petascale Configuration: CAM (30 km, L66) POP @ 0.1° 12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1° 42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1° Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors Ensemble of 10 runs/200 years Petascale Configuration: CAM (30 km, L66) POP @ 0.1° 12.5 years / wall-clock day [17K Cray XT4 processors] Sea-Ice @ 0.1° 42 years / wall-clock day [10K Cray XT3 processors Land model @ 0.1° Sequential Design (105 days per run) 32K BGL/ 10K XT3 processors Concurrent Design (33 days per run) 120K BGL / 42K XT3 processors
51
March 16, 2007Software Engineering Working Group Meeting52 POPIO benchmark on BGW
52
March 16, 2007Software Engineering Working Group Meeting53 CICE results (con’t) Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains: Large domains at low latitude -> higher boundary exchange cost Small domains at high latitude -> lower floating-point cost Optimal balance of computational and communication cost? Work in progress! Correct weighting increases simulation rate wSFC works best for high resolution Variable sized domains: Large domains at low latitude -> higher boundary exchange cost Small domains at high latitude -> lower floating-point cost Optimal balance of computational and communication cost? Work in progress!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.