Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Scaling CCSM to a Petascale system John M. Dennis: June 22, 2006 John M. Dennis: June 22,

Similar presentations


Presentation on theme: "1 Scaling CCSM to a Petascale system John M. Dennis: June 22, 2006 John M. Dennis: June 22,"— Presentation transcript:

1 1 Scaling CCSM to a Petascale system John M. Dennis: dennis@ucar.edudennis@ucar.edu June 22, 2006 John M. Dennis: dennis@ucar.edudennis@ucar.edu June 22, 2006

2 Software Engineering Working Group Meeting 2 Motivation  Petascale system with 100K - 500K proc  Trend or One off?  LLNL: 128K proc IBM BG/L  IBM Watson: 40K proc IBM BG/L  Sandia: 10K proc IBM RedStorm  ORNL/NCCS: 5K proc Cray XT3  10K (end of summer) -> 20K (Nov 06) -> ?  ANL: Large IBM BG/P system  We have prototypes for Petascale system!  Petascale system with 100K - 500K proc  Trend or One off?  LLNL: 128K proc IBM BG/L  IBM Watson: 40K proc IBM BG/L  Sandia: 10K proc IBM RedStorm  ORNL/NCCS: 5K proc Cray XT3  10K (end of summer) -> 20K (Nov 06) -> ?  ANL: Large IBM BG/P system  We have prototypes for Petascale system!

3 June 22, 2006Software Engineering Working Group Meeting 3 Motivation (con’t)  Prototype Petascale Application?  POP @ 0.1 degree  BGW 30K proc --> 7.9 years/wallclock day  RedStorm 8K proc--> 8.1 years/wallclock day  Can CCSM be a Petascale Application?  Look at each component separately  Current scalability limitations  Changes necessary to enable execution on large processor counts  Check scalability on BG/L  Prototype Petascale Application?  POP @ 0.1 degree  BGW 30K proc --> 7.9 years/wallclock day  RedStorm 8K proc--> 8.1 years/wallclock day  Can CCSM be a Petascale Application?  Look at each component separately  Current scalability limitations  Changes necessary to enable execution on large processor counts  Check scalability on BG/L

4 June 22, 2006Software Engineering Working Group Meeting 4 Motivation (con’t)  Why examine scalability on BG/L?  Prototype for Petascale system  Access to large processor counts  2K easily  40K through Blue Gene Watson Days  Scalable architecture  Limited memory:  256MB (VN)  512MB (CO)  Dedicated resources gives reproducible timings  Lessons translate to other systems [Cray XT3]  Why examine scalability on BG/L?  Prototype for Petascale system  Access to large processor counts  2K easily  40K through Blue Gene Watson Days  Scalable architecture  Limited memory:  256MB (VN)  512MB (CO)  Dedicated resources gives reproducible timings  Lessons translate to other systems [Cray XT3]

5 June 22, 2006Software Engineering Working Group Meeting 5 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

6 June 22, 2006Software Engineering Working Group Meeting 6 Parallel Ocean Program (POP)  Modified base POP 2.0 base code  Reduce execution time/ improve scalability  Minor changes (~9 files)  Rework barotropic solver  Improve load-balancing (space-filling curve)  Pilfered CICE boundary exchange [NEW]  Significant advances in performance  POP @ 1 degree  128 POWER4 processors --> 2.1x  POP @ 0.1 degree  30K BG/L processors --> 2x  8K RedStorm processors --> 1.3x  Modified base POP 2.0 base code  Reduce execution time/ improve scalability  Minor changes (~9 files)  Rework barotropic solver  Improve load-balancing (space-filling curve)  Pilfered CICE boundary exchange [NEW]  Significant advances in performance  POP @ 1 degree  128 POWER4 processors --> 2.1x  POP @ 0.1 degree  30K BG/L processors --> 2x  8K RedStorm processors --> 1.3x

7 June 22, 2006Software Engineering Working Group Meeting 7 POP using 20x24 blocks (gx1v3)  POP data structure  Flexible block structure  land ‘block’ elimination  Small blocks  Better {load balanced, land block elimination}  Larger halo overhead  Larger blocks  Smaller halo overhead  Load imbalanced  No land block elimination

8 June 22, 2006Software Engineering Working Group Meeting 8 Outline:  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions

9 June 22, 2006Software Engineering Working Group Meeting 9 Alternate Data Structure 2D data structure  Advantages  Regular stride-1 access  Compact form of stencil operator  Disadvantages  Includes land points  Problem specific data structure 2D data structure  Advantages  Regular stride-1 access  Compact form of stencil operator  Disadvantages  Includes land points  Problem specific data structure 1D data structure  Advantages  No more land points  General data structure  Disadvantages  Indirect addressing  Larger stencil operator

10 June 22, 2006Software Engineering Working Group Meeting 10 Using 1D data structures in POP2 solver (serial)  Replace solvers.F90  Execution time on cache microprocessors  Examine two CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product) [D’Azevedo 93]  Grid: test  [128x192 grid points]w/(16x16)  Replace solvers.F90  Execution time on cache microprocessors  Examine two CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product) [D’Azevedo 93]  Grid: test  [128x192 grid points]w/(16x16)

11 June 22, 2006Software Engineering Working Group Meeting 11 Serial execution time on IBM POWER4 (test) 56% reduction in cost/iteration

12 June 22, 2006Software Engineering Working Group Meeting 12 Using 1D data structure in POP2 solver (parallel)  New parallel halo update  Examine several CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product)  Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers  PCG solver  Preconditioners:  Diagonal  New parallel halo update  Examine several CG algorithms w/Diagonal precond  PCG2 ( 2 inner products)  PCG1 ( 1 inner product)  Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers  PCG solver  Preconditioners:  Diagonal

13 June 22, 2006Software Engineering Working Group Meeting 13 Solver execution time for POP2 (20x24) on BG/L (gx1v3) 48% cost/iteration 27% cost/iteration

14 June 22, 2006Software Engineering Working Group Meeting 14 Outline:  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions

15 June 22, 2006Software Engineering Working Group Meeting 15 CICE boundary exchange  POP applies 2D boundary exchange to 3D vars.  3D-update 2-33% of total time  Specialized 3D boundary exchange  Reduce message count  Increase message length  Reduces dependence on machine latency  Pilfer CICE 4.0 boundary exchange  Code Reuse! :-)  POP applies 2D boundary exchange to 3D vars.  3D-update 2-33% of total time  Specialized 3D boundary exchange  Reduce message count  Increase message length  Reduces dependence on machine latency  Pilfer CICE 4.0 boundary exchange  Code Reuse! :-)

16 June 22, 2006Software Engineering Working Group Meeting 16 Simulation rate of POP @ gx1v3 on IBM POWER4 ret 50% of time in solver

17 June 22, 2006Software Engineering Working Group Meeting 17 Performance of POP@gx1v3  Three code modifications  1D data structure  Space-filling curves  CICE boundary exchange  Cumulative impact is huge  Separately 10-20% each  Together 2.1x on 128 processors Small improvements add up!  Three code modifications  1D data structure  Space-filling curves  CICE boundary exchange  Cumulative impact is huge  Separately 10-20% each  Together 2.1x on 128 processors Small improvements add up!

18 June 22, 2006Software Engineering Working Group Meeting 18 Outline:  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  New Barotropic solver  CICE boundary exchange  Space-filling curves  CICE  CLM  CAM + Coupler  Conclusions

19 June 22, 2006Software Engineering Working Group Meeting 19 Partitioning with Space-filling Curves  Map 2D -> 1D  Variety of sizes  Hilbert (Nb=2 n)  Peano (Nb=3 m)  Cinco (Nb=5 p ) [New]  Hilbert-Peano (Nb=2 n 3 m )  Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p ) [New]  Partitioning 1D array Nb

20 June 22, 2006Software Engineering Working Group Meeting 20 Partitioning with SFC Partition for 3 processors

21 June 22, 2006Software Engineering Working Group Meeting 21 POP using 20x24 blocks (gx1v3)

22 June 22, 2006Software Engineering Working Group Meeting 22 POP (gx1v3) + Space-filling curve

23 June 22, 2006Software Engineering Working Group Meeting 23 Space-filling curve (Hilbert Nb=2 4 )

24 June 22, 2006Software Engineering Working Group Meeting 24 Remove Land blocks

25 June 22, 2006Software Engineering Working Group Meeting 25 Space-filling curve partition for 8 processors

26 June 22, 2006Software Engineering Working Group Meeting 26 0.1 degree POP  Global eddy-resolving  Computational grid:  3600 x 2400 x 40  Land creates problems:  load imbalances  scalability  Alternative partitioning algorithm:  Space-filling curves  Evaluate using Benchmark:  1 day/ Internal grid / 7 minute timestep  Global eddy-resolving  Computational grid:  3600 x 2400 x 40  Land creates problems:  load imbalances  scalability  Alternative partitioning algorithm:  Space-filling curves  Evaluate using Benchmark:  1 day/ Internal grid / 7 minute timestep

27 June 22, 2006Software Engineering Working Group Meeting 27 POP 0.1 degree benchmark on Blue Gene/L

28 June 22, 2006Software Engineering Working Group Meeting 28 POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley 50% of time in solver 33% of time in 3D-update

29 June 22, 2006Software Engineering Working Group Meeting 29 Remaining Issues: POP  Parallel I/O:  Decomposition in the vertical  Only parallel for 3D fields  Needs all to one communication  Need parallel I/O for 2D fields  Example: 0.1 degree POP on 30K BGL  Time to compute 1 day: 30 seconds  Time to read in 2D forcing files: 22 seconds  Parallel I/O:  Decomposition in the vertical  Only parallel for 3D fields  Needs all to one communication  Need parallel I/O for 2D fields  Example: 0.1 degree POP on 30K BGL  Time to compute 1 day: 30 seconds  Time to read in 2D forcing files: 22 seconds

30 June 22, 2006Software Engineering Working Group Meeting 30 Impact of 2x increase in simulation rate  IPCC AR5 control run [1000 years]  5 years per day ~= 6 months  10 years per day ~= 3 months  Huge jump in scientific productivity  Search larger parameter space  Longer sensitivity studies -> Find and fix problems much quicker  What about entire coupled system?  IPCC AR5 control run [1000 years]  5 years per day ~= 6 months  10 years per day ~= 3 months  Huge jump in scientific productivity  Search larger parameter space  Longer sensitivity studies -> Find and fix problems much quicker  What about entire coupled system?

31 June 22, 2006Software Engineering Working Group Meeting 31 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

32 June 22, 2006Software Engineering Working Group Meeting 32 CICE: Sea-ice Model  Shares grid and infrastructure with POP  CICE 4.0  Not quite ready for general release  Sub-block data structures (POP2)  Minimal experience with code base (<2 weeks)  Reuse techniques from POP2 work  Partitioning grid using weighted Space-filling curves?  Shares grid and infrastructure with POP  CICE 4.0  Not quite ready for general release  Sub-block data structures (POP2)  Minimal experience with code base (<2 weeks)  Reuse techniques from POP2 work  Partitioning grid using weighted Space-filling curves?

33 June 22, 2006Software Engineering Working Group Meeting 33 Weighted Space-filling curves  Estimate work for each grid block Work i = wo + P i *w1 wo: Fixed work for all blocks w1: Work if block contains Sea-ice P i :Probability block contains Sea-ice  Estimate work for each grid block Work i = wo + P i *w1 wo: Fixed work for all blocks w1: Work if block contains Sea-ice P i :Probability block contains Sea-ice

34 June 22, 2006Software Engineering Working Group Meeting 34 Weighted Space-filling curves (con’t)  Probability block contains Sea-ice  Depends on climate scenario  Control-run  Paelo  CO 2 doubling  Estimate of probability  Bad estimate -> Slower simulation rate  Weight space-filling curve  Partition for equal amounts of work  Probability block contains Sea-ice  Depends on climate scenario  Control-run  Paelo  CO 2 doubling  Estimate of probability  Bad estimate -> Slower simulation rate  Weight space-filling curve  Partition for equal amounts of work

35 June 22, 2006Software Engineering Working Group Meeting 35 Partitioning with w-SFC Partition for 5 processors

36 June 22, 2006Software Engineering Working Group Meeting 36 Remaining issues: CICE  Parallel I/O  Examine scalability with w-SFC  Active sea-ice ~15% of ocean grid  Estimate for 0.1 degree  RedStorm: ~4000 processors  Blue Gene/L: ~10000 processors Stay Tuned!  Parallel I/O  Examine scalability with w-SFC  Active sea-ice ~15% of ocean grid  Estimate for 0.1 degree  RedStorm: ~4000 processors  Blue Gene/L: ~10000 processors Stay Tuned!

37 June 22, 2006Software Engineering Working Group Meeting 37 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

38 June 22, 2006Software Engineering Working Group Meeting 38 Community Land Model (CLM2)  Fundamentally a scalable code  No communication between grid-points  Has some serial components….  River Transport Model (RTM)  Serial I/O (Collect on processor 0)  Fundamentally a scalable code  No communication between grid-points  Has some serial components….  River Transport Model (RTM)  Serial I/O (Collect on processor 0)

39 39 What is wrong with just a little serial code? Serial code is Evil!!

40 June 22, 2006Software Engineering Working Group Meeting 40 Why is Serial code Evil?  Seems innocent at first  Lead to much larger problems  Serial code:  Performance bottleneck to code  Excessive memory usage  Collecting stuff on one processor  Message passing information  Seems innocent at first  Lead to much larger problems  Serial code:  Performance bottleneck to code  Excessive memory usage  Collecting stuff on one processor  Message passing information

41 June 22, 2006Software Engineering Working Group Meeting 41 Cost of message passing information  Parallel code:  Each processor only communicates with small number of neighbors  O(1) information  Single serial component:  One processor communicates with all procesors  O(npes) information  Parallel code:  Each processor only communicates with small number of neighbors  O(1) information  Single serial component:  One processor communicates with all procesors  O(npes) information

42 June 22, 2006Software Engineering Working Group Meeting 42 Memory usage in subroutine: initDecomp  Four integer arrays:  dimension(ancells,npes)  ancells: number of land sub grid points (~20,000)  On 128 processors: 4*4*128*20,000 = 39 Mbytes/per processor  On 1024 processors: 4*4*1024*20,000 = 312 Mbytes/per processor  On 10,000 processors: 4*4*10,000*20,000 = 2.98 Gbytes/per processor -> 29 Tbytes across entire system  Four integer arrays:  dimension(ancells,npes)  ancells: number of land sub grid points (~20,000)  On 128 processors: 4*4*128*20,000 = 39 Mbytes/per processor  On 1024 processors: 4*4*1024*20,000 = 312 Mbytes/per processor  On 10,000 processors: 4*4*10,000*20,000 = 2.98 Gbytes/per processor -> 29 Tbytes across entire system

43 June 22, 2006Software Engineering Working Group Meeting 43 Memory use in CLM  Subroutine initDecomp deallocates large arrays  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on 32-512 BG/L processors  Subroutine initDecomp deallocates large arrays  CLM Configuration:  1x1.25 grid  No RTM  MAXPATCH_PFT = 4  No CN, DGVM  Measure stack and heap on 32-512 BG/L processors

44 June 22, 2006Software Engineering Working Group Meeting 44 Memory use for CLM on BG/L

45 June 22, 2006Software Engineering Working Group Meeting 45 Non-scalable memory usage  Common problem  Easy to ignore on 128 processors  Fatal on large processor counts  Avoid array dimension with  npes  Fixed size  Eliminate serial code!!  Re-evaluate initialization code (scalable?)  Remember: Innocent looking non-scalable code can kill!  Common problem  Easy to ignore on 128 processors  Fatal on large processor counts  Avoid array dimension with  npes  Fixed size  Eliminate serial code!!  Re-evaluate initialization code (scalable?)  Remember: Innocent looking non-scalable code can kill!

46 June 22, 2006Software Engineering Working Group Meeting 46 Outline:  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions  Motivation  POP  CICE  CLM  CAM + Coupler  Conclusions

47 June 22, 2006Software Engineering Working Group Meeting 47 CAM + Coupler  CAM  Extensive benchmarking [P. Worley]  Generalizing interface for modular dynamics  Non lat-lon grids [B. Eaton]  Quasi uniform grids (cubed-sphere, icoshedral)  Ported to BGL [S. Ghosh]  Required rewrite on I/O  FV-core resolution limited due to memory  Coupler  Will examine single executable concurrent system (Summer 06)  CAM  Extensive benchmarking [P. Worley]  Generalizing interface for modular dynamics  Non lat-lon grids [B. Eaton]  Quasi uniform grids (cubed-sphere, icoshedral)  Ported to BGL [S. Ghosh]  Required rewrite on I/O  FV-core resolution limited due to memory  Coupler  Will examine single executable concurrent system (Summer 06)

48 June 22, 2006Software Engineering Working Group Meeting 48 A Petascale coupled system  Design principles:  Simple/elegant design  Attention to implementation details  Single executable -> run on any thing vendors provide  minimizes communication hotspots  Concurrent execution creates hotspots  E.G. waste bisection bandwidth by passing fluxes to coupler  Design principles:  Simple/elegant design  Attention to implementation details  Single executable -> run on any thing vendors provide  minimizes communication hotspots  Concurrent execution creates hotspots  E.G. waste bisection bandwidth by passing fluxes to coupler

49 June 22, 2006Software Engineering Working Group Meeting 49 A Petascale coupled system (con’t)  Sequential execution  Flux interpolation just a boundary exchange  Simplifies cost budget  All components must be scalable  Quasi-uniform grid  Flux interpolation should be communication with small number of nearest neighbors  Minimizes interpolation costs  Sequential execution  Flux interpolation just a boundary exchange  Simplifies cost budget  All components must be scalable  Quasi-uniform grid  Flux interpolation should be communication with small number of nearest neighbors  Minimizes interpolation costs

50 June 22, 2006Software Engineering Working Group Meeting 50 Possible Configuration  CAM (100 km, L66)  POP @ 0.1 degree  Demonstrated 30 seconds per day  Sea-Ice @ 0.1 degree  Land model (50 km)  Sequential Coupler  CAM (100 km, L66)  POP @ 0.1 degree  Demonstrated 30 seconds per day  Sea-Ice @ 0.1 degree  Land model (50 km)  Sequential Coupler

51 June 22, 2006Software Engineering Working Group Meeting 51 High-Resolution CCSM on ~30K BG/L processors Time per day (secs) demonstratedBudgetActual POP@0.1Yes [03/29/06]3030.1 CICE@0.1No [Summer 06]8 Land (50 km) No [Summer 06]5 Atm + Chem (100 km) No [Fall 06]77 CouplerNo [Fall 06]10 TotalNo [Spring 07]130 ~1.8 years/wallclock day

52 June 22, 2006Software Engineering Working Group Meeting 52 Conclusions  Examine scalability of several components on BG/L  Stress limits of resolution and processor count  Uncover problems in code  Is possible to use large # proc POP @ 0.1  Results obtain by modifying ~9 files  BGW 30K proc --> 7.9 years/wallclock day  33% of time in 3D-update -> CICE boundary exchange  RedStorm 8K proc--> 8.1 years/wallclock day  50% of time in solver -> use preconditioner  Examine scalability of several components on BG/L  Stress limits of resolution and processor count  Uncover problems in code  Is possible to use large # proc POP @ 0.1  Results obtain by modifying ~9 files  BGW 30K proc --> 7.9 years/wallclock day  33% of time in 3D-update -> CICE boundary exchange  RedStorm 8K proc--> 8.1 years/wallclock day  50% of time in solver -> use preconditioner

53 June 22, 2006Software Engineering Working Group Meeting 53 Conclusions (con’t)  CICE needs  Improved load-balancing (w-SFC)  CLM needs  Parallelize RTM, I/O  Cleanup non-scalable data structures  Common Issues:  Focus on returning advances into models  Vector mods in POP?  Parallel I/O in CAM?  High-resolution CRIEPI work?  Parallel I/O  Eliminate all serial code!  Watch the memory usage  CICE needs  Improved load-balancing (w-SFC)  CLM needs  Parallelize RTM, I/O  Cleanup non-scalable data structures  Common Issues:  Focus on returning advances into models  Vector mods in POP?  Parallel I/O in CAM?  High-resolution CRIEPI work?  Parallel I/O  Eliminate all serial code!  Watch the memory usage

54 June 22, 2006Software Engineering Working Group Meeting 54 Conclusions (con’t)  Efficient use of Petascale system is possible!  Path to Petascale computing: 1.Test the limits of our codes 2.Fix resulting problems 3.Goto 1.  Efficient use of Petascale system is possible!  Path to Petascale computing: 1.Test the limits of our codes 2.Fix resulting problems 3.Goto 1.

55 June 22, 2006Software Engineering Working Group Meeting 55 Acknowledgements/Questions?  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL)  Computer Time:  Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) LLNL  RedStorm time: Sandia et

56 June 22, 2006Software Engineering Working Group Meeting 56 eta1_local=0.0D0 do i=1,nActive Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r) enddo Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n) !----------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if(lprecond) call update_halo(Z) eta1 = global_sum(eta1_local,distrb_tropic) cg_beta = eta1/eta0 do i=1,n S(i) = Z(i) + S(i)*cg_beta enddo call matvec(n,A,Q,S) !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo eta1_local=0.0D0 do i=1,nActive Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r) enddo Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n) !----------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if(lprecond) call update_halo(Z) eta1 = global_sum(eta1_local,distrb_tropic) cg_beta = eta1/eta0 do i=1,n S(i) = Z(i) + S(i)*cg_beta enddo call matvec(n,A,Q,S) !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo

57 June 22, 2006Software Engineering Working Group Meeting 57 do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock) elsewhere WORK1(:,:,iblock) = c0 endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop !---------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0) !---------------------------------------------------------------------- ! compute As !----------------------------------------------------------------------- call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock) elsewhere WORK1(:,:,iblock) = c0 endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop !---------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0) !---------------------------------------------------------------------- ! compute As !----------------------------------------------------------------------- call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop

58 June 22, 2006Software Engineering Working Group Meeting 58 Piece 1D data structure solver !----------------------------------------------------- ! compute next solution and residual !----------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo !----------------------------------------------------- ! compute next solution and residual !----------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo Update Halo Dot product Update vectors ret

59 June 22, 2006Software Engineering Working Group Meeting 59 POP 0.1 degree blocksizeNbNb 2 Max || 36x24100100007545 30x201201440010705 24x161502250016528 18x122004000028972 15x102405760041352 12x83009000064074 Increasing || --> Decreasing overhead -->

60 June 22, 2006Software Engineering Working Group Meeting 60 Serial Execution time on Multiple platforms (test)

61 June 22, 2006Software Engineering Working Group Meeting 61 The Unexpected Problem: Just because your code scales to N processors, does not mean it will scale to k*N, where k>=4.


Download ppt "1 Scaling CCSM to a Petascale system John M. Dennis: June 22, 2006 John M. Dennis: June 22,"

Similar presentations


Ads by Google