Download presentation
Presentation is loading. Please wait.
Published byAnn Howard Modified over 9 years ago
1
1 Scaling CCSM to a Petascale system John M. Dennis: dennis@ucar.edudennis@ucar.edu June 22, 2006 John M. Dennis: dennis@ucar.edudennis@ucar.edu June 22, 2006
2
Software Engineering Working Group Meeting 2 Motivation Petascale system with 100K - 500K proc Trend or One off? LLNL: 128K proc IBM BG/L IBM Watson: 40K proc IBM BG/L Sandia: 10K proc IBM RedStorm ORNL/NCCS: 5K proc Cray XT3 10K (end of summer) -> 20K (Nov 06) -> ? ANL: Large IBM BG/P system We have prototypes for Petascale system! Petascale system with 100K - 500K proc Trend or One off? LLNL: 128K proc IBM BG/L IBM Watson: 40K proc IBM BG/L Sandia: 10K proc IBM RedStorm ORNL/NCCS: 5K proc Cray XT3 10K (end of summer) -> 20K (Nov 06) -> ? ANL: Large IBM BG/P system We have prototypes for Petascale system!
3
June 22, 2006Software Engineering Working Group Meeting 3 Motivation (con’t) Prototype Petascale Application? POP @ 0.1 degree BGW 30K proc --> 7.9 years/wallclock day RedStorm 8K proc--> 8.1 years/wallclock day Can CCSM be a Petascale Application? Look at each component separately Current scalability limitations Changes necessary to enable execution on large processor counts Check scalability on BG/L Prototype Petascale Application? POP @ 0.1 degree BGW 30K proc --> 7.9 years/wallclock day RedStorm 8K proc--> 8.1 years/wallclock day Can CCSM be a Petascale Application? Look at each component separately Current scalability limitations Changes necessary to enable execution on large processor counts Check scalability on BG/L
4
June 22, 2006Software Engineering Working Group Meeting 4 Motivation (con’t) Why examine scalability on BG/L? Prototype for Petascale system Access to large processor counts 2K easily 40K through Blue Gene Watson Days Scalable architecture Limited memory: 256MB (VN) 512MB (CO) Dedicated resources gives reproducible timings Lessons translate to other systems [Cray XT3] Why examine scalability on BG/L? Prototype for Petascale system Access to large processor counts 2K easily 40K through Blue Gene Watson Days Scalable architecture Limited memory: 256MB (VN) 512MB (CO) Dedicated resources gives reproducible timings Lessons translate to other systems [Cray XT3]
5
June 22, 2006Software Engineering Working Group Meeting 5 Outline: Motivation POP CICE CLM CAM + Coupler Conclusions Motivation POP CICE CLM CAM + Coupler Conclusions
6
June 22, 2006Software Engineering Working Group Meeting 6 Parallel Ocean Program (POP) Modified base POP 2.0 base code Reduce execution time/ improve scalability Minor changes (~9 files) Rework barotropic solver Improve load-balancing (space-filling curve) Pilfered CICE boundary exchange [NEW] Significant advances in performance POP @ 1 degree 128 POWER4 processors --> 2.1x POP @ 0.1 degree 30K BG/L processors --> 2x 8K RedStorm processors --> 1.3x Modified base POP 2.0 base code Reduce execution time/ improve scalability Minor changes (~9 files) Rework barotropic solver Improve load-balancing (space-filling curve) Pilfered CICE boundary exchange [NEW] Significant advances in performance POP @ 1 degree 128 POWER4 processors --> 2.1x POP @ 0.1 degree 30K BG/L processors --> 2x 8K RedStorm processors --> 1.3x
7
June 22, 2006Software Engineering Working Group Meeting 7 POP using 20x24 blocks (gx1v3) POP data structure Flexible block structure land ‘block’ elimination Small blocks Better {load balanced, land block elimination} Larger halo overhead Larger blocks Smaller halo overhead Load imbalanced No land block elimination
8
June 22, 2006Software Engineering Working Group Meeting 8 Outline: Motivation POP New Barotropic solver CICE boundary exchange Space-filling curves CICE CLM CAM + Coupler Conclusions Motivation POP New Barotropic solver CICE boundary exchange Space-filling curves CICE CLM CAM + Coupler Conclusions
9
June 22, 2006Software Engineering Working Group Meeting 9 Alternate Data Structure 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 2D data structure Advantages Regular stride-1 access Compact form of stencil operator Disadvantages Includes land points Problem specific data structure 1D data structure Advantages No more land points General data structure Disadvantages Indirect addressing Larger stencil operator
10
June 22, 2006Software Engineering Working Group Meeting 10 Using 1D data structures in POP2 solver (serial) Replace solvers.F90 Execution time on cache microprocessors Examine two CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) [D’Azevedo 93] Grid: test [128x192 grid points]w/(16x16) Replace solvers.F90 Execution time on cache microprocessors Examine two CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) [D’Azevedo 93] Grid: test [128x192 grid points]w/(16x16)
11
June 22, 2006Software Engineering Working Group Meeting 11 Serial execution time on IBM POWER4 (test) 56% reduction in cost/iteration
12
June 22, 2006Software Engineering Working Group Meeting 12 Using 1D data structure in POP2 solver (parallel) New parallel halo update Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners: Diagonal New parallel halo update Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product) Existing solver/preconditioner technology: Hypre (LLNL) http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners: Diagonal
13
June 22, 2006Software Engineering Working Group Meeting 13 Solver execution time for POP2 (20x24) on BG/L (gx1v3) 48% cost/iteration 27% cost/iteration
14
June 22, 2006Software Engineering Working Group Meeting 14 Outline: Motivation POP New Barotropic solver CICE boundary exchange Space-filling curves CICE CLM CAM + Coupler Conclusions Motivation POP New Barotropic solver CICE boundary exchange Space-filling curves CICE CLM CAM + Coupler Conclusions
15
June 22, 2006Software Engineering Working Group Meeting 15 CICE boundary exchange POP applies 2D boundary exchange to 3D vars. 3D-update 2-33% of total time Specialized 3D boundary exchange Reduce message count Increase message length Reduces dependence on machine latency Pilfer CICE 4.0 boundary exchange Code Reuse! :-) POP applies 2D boundary exchange to 3D vars. 3D-update 2-33% of total time Specialized 3D boundary exchange Reduce message count Increase message length Reduces dependence on machine latency Pilfer CICE 4.0 boundary exchange Code Reuse! :-)
16
June 22, 2006Software Engineering Working Group Meeting 16 Simulation rate of POP @ gx1v3 on IBM POWER4 ret 50% of time in solver
17
June 22, 2006Software Engineering Working Group Meeting 17 Performance of POP@gx1v3 Three code modifications 1D data structure Space-filling curves CICE boundary exchange Cumulative impact is huge Separately 10-20% each Together 2.1x on 128 processors Small improvements add up! Three code modifications 1D data structure Space-filling curves CICE boundary exchange Cumulative impact is huge Separately 10-20% each Together 2.1x on 128 processors Small improvements add up!
18
June 22, 2006Software Engineering Working Group Meeting 18 Outline: Motivation POP New Barotropic solver CICE boundary exchange Space-filling curves CICE CLM CAM + Coupler Conclusions Motivation POP New Barotropic solver CICE boundary exchange Space-filling curves CICE CLM CAM + Coupler Conclusions
19
June 22, 2006Software Engineering Working Group Meeting 19 Partitioning with Space-filling Curves Map 2D -> 1D Variety of sizes Hilbert (Nb=2 n) Peano (Nb=3 m) Cinco (Nb=5 p ) [New] Hilbert-Peano (Nb=2 n 3 m ) Hilbert-Peano-Cinco (Nb=2 n 3 m 5 p ) [New] Partitioning 1D array Nb
20
June 22, 2006Software Engineering Working Group Meeting 20 Partitioning with SFC Partition for 3 processors
21
June 22, 2006Software Engineering Working Group Meeting 21 POP using 20x24 blocks (gx1v3)
22
June 22, 2006Software Engineering Working Group Meeting 22 POP (gx1v3) + Space-filling curve
23
June 22, 2006Software Engineering Working Group Meeting 23 Space-filling curve (Hilbert Nb=2 4 )
24
June 22, 2006Software Engineering Working Group Meeting 24 Remove Land blocks
25
June 22, 2006Software Engineering Working Group Meeting 25 Space-filling curve partition for 8 processors
26
June 22, 2006Software Engineering Working Group Meeting 26 0.1 degree POP Global eddy-resolving Computational grid: 3600 x 2400 x 40 Land creates problems: load imbalances scalability Alternative partitioning algorithm: Space-filling curves Evaluate using Benchmark: 1 day/ Internal grid / 7 minute timestep Global eddy-resolving Computational grid: 3600 x 2400 x 40 Land creates problems: load imbalances scalability Alternative partitioning algorithm: Space-filling curves Evaluate using Benchmark: 1 day/ Internal grid / 7 minute timestep
27
June 22, 2006Software Engineering Working Group Meeting 27 POP 0.1 degree benchmark on Blue Gene/L
28
June 22, 2006Software Engineering Working Group Meeting 28 POP 0.1 degree benchmark Courtesy of Y. Yoshida, M. Taylor, P. Worley 50% of time in solver 33% of time in 3D-update
29
June 22, 2006Software Engineering Working Group Meeting 29 Remaining Issues: POP Parallel I/O: Decomposition in the vertical Only parallel for 3D fields Needs all to one communication Need parallel I/O for 2D fields Example: 0.1 degree POP on 30K BGL Time to compute 1 day: 30 seconds Time to read in 2D forcing files: 22 seconds Parallel I/O: Decomposition in the vertical Only parallel for 3D fields Needs all to one communication Need parallel I/O for 2D fields Example: 0.1 degree POP on 30K BGL Time to compute 1 day: 30 seconds Time to read in 2D forcing files: 22 seconds
30
June 22, 2006Software Engineering Working Group Meeting 30 Impact of 2x increase in simulation rate IPCC AR5 control run [1000 years] 5 years per day ~= 6 months 10 years per day ~= 3 months Huge jump in scientific productivity Search larger parameter space Longer sensitivity studies -> Find and fix problems much quicker What about entire coupled system? IPCC AR5 control run [1000 years] 5 years per day ~= 6 months 10 years per day ~= 3 months Huge jump in scientific productivity Search larger parameter space Longer sensitivity studies -> Find and fix problems much quicker What about entire coupled system?
31
June 22, 2006Software Engineering Working Group Meeting 31 Outline: Motivation POP CICE CLM CAM + Coupler Conclusions Motivation POP CICE CLM CAM + Coupler Conclusions
32
June 22, 2006Software Engineering Working Group Meeting 32 CICE: Sea-ice Model Shares grid and infrastructure with POP CICE 4.0 Not quite ready for general release Sub-block data structures (POP2) Minimal experience with code base (<2 weeks) Reuse techniques from POP2 work Partitioning grid using weighted Space-filling curves? Shares grid and infrastructure with POP CICE 4.0 Not quite ready for general release Sub-block data structures (POP2) Minimal experience with code base (<2 weeks) Reuse techniques from POP2 work Partitioning grid using weighted Space-filling curves?
33
June 22, 2006Software Engineering Working Group Meeting 33 Weighted Space-filling curves Estimate work for each grid block Work i = wo + P i *w1 wo: Fixed work for all blocks w1: Work if block contains Sea-ice P i :Probability block contains Sea-ice Estimate work for each grid block Work i = wo + P i *w1 wo: Fixed work for all blocks w1: Work if block contains Sea-ice P i :Probability block contains Sea-ice
34
June 22, 2006Software Engineering Working Group Meeting 34 Weighted Space-filling curves (con’t) Probability block contains Sea-ice Depends on climate scenario Control-run Paelo CO 2 doubling Estimate of probability Bad estimate -> Slower simulation rate Weight space-filling curve Partition for equal amounts of work Probability block contains Sea-ice Depends on climate scenario Control-run Paelo CO 2 doubling Estimate of probability Bad estimate -> Slower simulation rate Weight space-filling curve Partition for equal amounts of work
35
June 22, 2006Software Engineering Working Group Meeting 35 Partitioning with w-SFC Partition for 5 processors
36
June 22, 2006Software Engineering Working Group Meeting 36 Remaining issues: CICE Parallel I/O Examine scalability with w-SFC Active sea-ice ~15% of ocean grid Estimate for 0.1 degree RedStorm: ~4000 processors Blue Gene/L: ~10000 processors Stay Tuned! Parallel I/O Examine scalability with w-SFC Active sea-ice ~15% of ocean grid Estimate for 0.1 degree RedStorm: ~4000 processors Blue Gene/L: ~10000 processors Stay Tuned!
37
June 22, 2006Software Engineering Working Group Meeting 37 Outline: Motivation POP CICE CLM CAM + Coupler Conclusions Motivation POP CICE CLM CAM + Coupler Conclusions
38
June 22, 2006Software Engineering Working Group Meeting 38 Community Land Model (CLM2) Fundamentally a scalable code No communication between grid-points Has some serial components…. River Transport Model (RTM) Serial I/O (Collect on processor 0) Fundamentally a scalable code No communication between grid-points Has some serial components…. River Transport Model (RTM) Serial I/O (Collect on processor 0)
39
39 What is wrong with just a little serial code? Serial code is Evil!!
40
June 22, 2006Software Engineering Working Group Meeting 40 Why is Serial code Evil? Seems innocent at first Lead to much larger problems Serial code: Performance bottleneck to code Excessive memory usage Collecting stuff on one processor Message passing information Seems innocent at first Lead to much larger problems Serial code: Performance bottleneck to code Excessive memory usage Collecting stuff on one processor Message passing information
41
June 22, 2006Software Engineering Working Group Meeting 41 Cost of message passing information Parallel code: Each processor only communicates with small number of neighbors O(1) information Single serial component: One processor communicates with all procesors O(npes) information Parallel code: Each processor only communicates with small number of neighbors O(1) information Single serial component: One processor communicates with all procesors O(npes) information
42
June 22, 2006Software Engineering Working Group Meeting 42 Memory usage in subroutine: initDecomp Four integer arrays: dimension(ancells,npes) ancells: number of land sub grid points (~20,000) On 128 processors: 4*4*128*20,000 = 39 Mbytes/per processor On 1024 processors: 4*4*1024*20,000 = 312 Mbytes/per processor On 10,000 processors: 4*4*10,000*20,000 = 2.98 Gbytes/per processor -> 29 Tbytes across entire system Four integer arrays: dimension(ancells,npes) ancells: number of land sub grid points (~20,000) On 128 processors: 4*4*128*20,000 = 39 Mbytes/per processor On 1024 processors: 4*4*1024*20,000 = 312 Mbytes/per processor On 10,000 processors: 4*4*10,000*20,000 = 2.98 Gbytes/per processor -> 29 Tbytes across entire system
43
June 22, 2006Software Engineering Working Group Meeting 43 Memory use in CLM Subroutine initDecomp deallocates large arrays CLM Configuration: 1x1.25 grid No RTM MAXPATCH_PFT = 4 No CN, DGVM Measure stack and heap on 32-512 BG/L processors Subroutine initDecomp deallocates large arrays CLM Configuration: 1x1.25 grid No RTM MAXPATCH_PFT = 4 No CN, DGVM Measure stack and heap on 32-512 BG/L processors
44
June 22, 2006Software Engineering Working Group Meeting 44 Memory use for CLM on BG/L
45
June 22, 2006Software Engineering Working Group Meeting 45 Non-scalable memory usage Common problem Easy to ignore on 128 processors Fatal on large processor counts Avoid array dimension with npes Fixed size Eliminate serial code!! Re-evaluate initialization code (scalable?) Remember: Innocent looking non-scalable code can kill! Common problem Easy to ignore on 128 processors Fatal on large processor counts Avoid array dimension with npes Fixed size Eliminate serial code!! Re-evaluate initialization code (scalable?) Remember: Innocent looking non-scalable code can kill!
46
June 22, 2006Software Engineering Working Group Meeting 46 Outline: Motivation POP CICE CLM CAM + Coupler Conclusions Motivation POP CICE CLM CAM + Coupler Conclusions
47
June 22, 2006Software Engineering Working Group Meeting 47 CAM + Coupler CAM Extensive benchmarking [P. Worley] Generalizing interface for modular dynamics Non lat-lon grids [B. Eaton] Quasi uniform grids (cubed-sphere, icoshedral) Ported to BGL [S. Ghosh] Required rewrite on I/O FV-core resolution limited due to memory Coupler Will examine single executable concurrent system (Summer 06) CAM Extensive benchmarking [P. Worley] Generalizing interface for modular dynamics Non lat-lon grids [B. Eaton] Quasi uniform grids (cubed-sphere, icoshedral) Ported to BGL [S. Ghosh] Required rewrite on I/O FV-core resolution limited due to memory Coupler Will examine single executable concurrent system (Summer 06)
48
June 22, 2006Software Engineering Working Group Meeting 48 A Petascale coupled system Design principles: Simple/elegant design Attention to implementation details Single executable -> run on any thing vendors provide minimizes communication hotspots Concurrent execution creates hotspots E.G. waste bisection bandwidth by passing fluxes to coupler Design principles: Simple/elegant design Attention to implementation details Single executable -> run on any thing vendors provide minimizes communication hotspots Concurrent execution creates hotspots E.G. waste bisection bandwidth by passing fluxes to coupler
49
June 22, 2006Software Engineering Working Group Meeting 49 A Petascale coupled system (con’t) Sequential execution Flux interpolation just a boundary exchange Simplifies cost budget All components must be scalable Quasi-uniform grid Flux interpolation should be communication with small number of nearest neighbors Minimizes interpolation costs Sequential execution Flux interpolation just a boundary exchange Simplifies cost budget All components must be scalable Quasi-uniform grid Flux interpolation should be communication with small number of nearest neighbors Minimizes interpolation costs
50
June 22, 2006Software Engineering Working Group Meeting 50 Possible Configuration CAM (100 km, L66) POP @ 0.1 degree Demonstrated 30 seconds per day Sea-Ice @ 0.1 degree Land model (50 km) Sequential Coupler CAM (100 km, L66) POP @ 0.1 degree Demonstrated 30 seconds per day Sea-Ice @ 0.1 degree Land model (50 km) Sequential Coupler
51
June 22, 2006Software Engineering Working Group Meeting 51 High-Resolution CCSM on ~30K BG/L processors Time per day (secs) demonstratedBudgetActual POP@0.1Yes [03/29/06]3030.1 CICE@0.1No [Summer 06]8 Land (50 km) No [Summer 06]5 Atm + Chem (100 km) No [Fall 06]77 CouplerNo [Fall 06]10 TotalNo [Spring 07]130 ~1.8 years/wallclock day
52
June 22, 2006Software Engineering Working Group Meeting 52 Conclusions Examine scalability of several components on BG/L Stress limits of resolution and processor count Uncover problems in code Is possible to use large # proc POP @ 0.1 Results obtain by modifying ~9 files BGW 30K proc --> 7.9 years/wallclock day 33% of time in 3D-update -> CICE boundary exchange RedStorm 8K proc--> 8.1 years/wallclock day 50% of time in solver -> use preconditioner Examine scalability of several components on BG/L Stress limits of resolution and processor count Uncover problems in code Is possible to use large # proc POP @ 0.1 Results obtain by modifying ~9 files BGW 30K proc --> 7.9 years/wallclock day 33% of time in 3D-update -> CICE boundary exchange RedStorm 8K proc--> 8.1 years/wallclock day 50% of time in solver -> use preconditioner
53
June 22, 2006Software Engineering Working Group Meeting 53 Conclusions (con’t) CICE needs Improved load-balancing (w-SFC) CLM needs Parallelize RTM, I/O Cleanup non-scalable data structures Common Issues: Focus on returning advances into models Vector mods in POP? Parallel I/O in CAM? High-resolution CRIEPI work? Parallel I/O Eliminate all serial code! Watch the memory usage CICE needs Improved load-balancing (w-SFC) CLM needs Parallelize RTM, I/O Cleanup non-scalable data structures Common Issues: Focus on returning advances into models Vector mods in POP? Parallel I/O in CAM? High-resolution CRIEPI work? Parallel I/O Eliminate all serial code! Watch the memory usage
54
June 22, 2006Software Engineering Working Group Meeting 54 Conclusions (con’t) Efficient use of Petascale system is possible! Path to Petascale computing: 1.Test the limits of our codes 2.Fix resulting problems 3.Goto 1. Efficient use of Petascale system is possible! Path to Petascale computing: 1.Test the limits of our codes 2.Fix resulting problems 3.Goto 1.
55
June 22, 2006Software Engineering Working Group Meeting 55 Acknowledgements/Questions? Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Thanks to: D. Bailey (NCAR) F. Bryan (NCAR) T. Craig (NCAR) J. Edwards (IBM) E. Hunke (LANL) B. Kadlec (CU) E. Jessup (CU) P. Jones (LANL) K. Lindsay (NCAR) W. Lipscomb (LANL) M. Taylor (SNL) H. Tufo (NCAR) M. Vertenstein (NCAR) S. Weese (NCAR) P. Worley (ORNL) Computer Time: Blue Gene/L time: NSF MRI Grant NCAR University of Colorado IBM (SUR) program BGW Consortium Days IBM research (Watson) LLNL RedStorm time: Sandia et
56
June 22, 2006Software Engineering Working Group Meeting 56 eta1_local=0.0D0 do i=1,nActive Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r) enddo Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n) !----------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if(lprecond) call update_halo(Z) eta1 = global_sum(eta1_local,distrb_tropic) cg_beta = eta1/eta0 do i=1,n S(i) = Z(i) + S(i)*cg_beta enddo call matvec(n,A,Q,S) !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo eta1_local=0.0D0 do i=1,nActive Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r) enddo Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n) !----------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if(lprecond) call update_halo(Z) eta1 = global_sum(eta1_local,distrb_tropic) cg_beta = eta1/eta0 do i=1,n S(i) = Z(i) + S(i)*cg_beta enddo call matvec(n,A,Q,S) !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo
57
June 22, 2006Software Engineering Working Group Meeting 57 do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock) elsewhere WORK1(:,:,iblock) = c0 endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop !---------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0) !---------------------------------------------------------------------- ! compute As !----------------------------------------------------------------------- call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock) elsewhere WORK1(:,:,iblock) = c0 endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop !---------------------------------------------------------------------- ! update conjugate direction vector s !----------------------------------------------------------------------- if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0) !---------------------------------------------------------------------- ! compute As !----------------------------------------------------------------------- call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop !----------------------------------------------------------------------- ! compute next solution and residual !----------------------------------------------------------------------- call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop
58
June 22, 2006Software Engineering Working Group Meeting 58 Piece 1D data structure solver !----------------------------------------------------- ! compute next solution and residual !----------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo !----------------------------------------------------- ! compute next solution and residual !----------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp = global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo Update Halo Dot product Update vectors ret
59
June 22, 2006Software Engineering Working Group Meeting 59 POP 0.1 degree blocksizeNbNb 2 Max || 36x24100100007545 30x201201440010705 24x161502250016528 18x122004000028972 15x102405760041352 12x83009000064074 Increasing || --> Decreasing overhead -->
60
June 22, 2006Software Engineering Working Group Meeting 60 Serial Execution time on Multiple platforms (test)
61
June 22, 2006Software Engineering Working Group Meeting 61 The Unexpected Problem: Just because your code scales to N processors, does not mean it will scale to k*N, where k>=4.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.