Overview of Nesting in the NMM-B Tom Black 1 April 2015 NMM-B User Tutorial.

Overview of Nesting in the NMM-B Tom Black 1 April 2015 NMM-B User Tutorial

2 ● The grid ● Motion ● Upscale (2-way) exchange ● User specification of nest-related variables ● Sequence of execution ● 1-way interactive / MPI task usage ● 2-way interactive / MPI task usage ● Nest characteristics ● Scaling

General Characteristics of NMM-B Nests 3 Static / moving ● 1-way / 2-way interactive ● Multiple nests run simultaneously ● Telescoping domains ● Bit restartable ( static / moving / 1-way / 2-way ) ● Parent-oriented ●

NEMS Structure MAIN EARTH(1:NM) Ocean AtmIce NMM GSM FIM Solver Domain (1:ND) Wrt Dyn Phy WrtDyn Phy Wrt NEMS Earth Ensemble Mediator Chem All boxes represent ESMF components. 4 Atm-Ocn Mediator parents and children

1-Way Integration for Three Generations Parent updates child BCs ΔtΔt par ΔtΔt child ΔtΔt grandchild 5 All generations integrate concurrently.

Task Usage for NMM-B 1-Way Nesting The user distributes all available compute tasks among the various domains and fine-tunes those assignments (along with those of quilt tasks) so that parents and their children proceed in the forecast at virtually the same rate as all domains integrate concurrently. This gives the user the ability to optimize the work load balance. 6 NOTE: For all nesting, the physical dimensions of a parent ‘s tasks’ subdomains MUST BE LARGER than the physical dimensions of its children’s tasks’ subdomains.

NMM-B with 1-Way Nesting using 72 Compute Tasks 7 generation #2 tasks 8-47 generation #1 tasks 0-7 generation #3 tasks 48-71 2 2 4 32 24 8

Preliminary Estimate of 1-Way Compute Task Assignments There are N compute tasks available. 8 Domain #1: IM1, JM1 DT1 => Work1 = IM1 x JM1 Domain #2: IM2, JM2 DT2 => Work2 = IM2 x JM2 x ( DT1 / DT2 ) Total Work = TW = Work1 + Work2 + Work3 + Work4 + Work5 Domain #1 compute tasks: (Work1 / TW) x N Domain #2 compute tasks: (Work2 / TW) x N Domain #3: IM3, JM3 DT3 => Work3 = IM3 x JM3 x ( DT1 / DT3 ) Domain #4: IM4, JM4 DT4 => Work4 = IM4 x JM4 x ( DT1 / DT4 ) Domain #5: IM5, JM5 DT5 => Work5 = IM5 x JM5 x ( DT1 / DT5 ) Domain #3 compute tasks: (Work3 / TW) x N Domain #4 compute tasks: (Work4 / TW) x N Domain #5 compute tasks: (Work5 / TW) x N There are 3 generations with 1 domain, 2 domains, and 2 domains, respectively.

Some Key Timers 9 cpl1_recv_tim: Child wait time to recv BC data Appears as ‘cpl recv = ‘ in stdout file cpl2_wait_tim: Parent wait time for BC send to finish Appears as ‘cpl wait = ‘ in stdout file If child wait time is large then child is too fast relative to parent. If parent wait time is large => parent is too fast relative to child. => Reduce child tasks, increase parent tasks. => Reduce parent tasks, increase child tasks.

10 Parent runs at 12 km to 84 hr Four static nests run to 60 hr – 4 km CONUS nest (3-to-1) – 6 km Alaska nest (2-to-1) – 3 km HI & PR nests (4-to-1) Single relocatable 1.33km or 1.5km FireWeather grandchild run to 36hr (3-to-1 or 4-to-1) 10 Current Operational NAM with 1-Way Static Nests

Relative Compute Resources used by NAM Nests 3 km Puerto Rico nest 4% 1.33 km CONUS FireWx nest 17% 4 km CONUS nest 57% 57% 3 km Hawaii nest 5% 12 km parent 10% 6 km Alaska nest 7% 7% 11

2-Way Integration for Three Generations Parent updates child BCs Child updates parent ΔtΔt par ΔtΔt child ΔtΔt grandchild 12 Only one generation can be active at a given time.

Use 1-Way Task Assignment Strategy in 2-Way Nests? 13 NO – Too many tasks can sit idle since domains are active in only one generation at a time. Therefore use a different approach based on the generations of domains.

NMM-B with 1-Way Nesting using 72 Compute Tasks 14 generation #2 tasks 8-47 generation #1 tasks 0-7 generation #3 tasks 48-71 Only 40 of 72 tasks working in the busiest generation if using this method for 2-way. 2 2 4 32 24 8

Basic Strategy for 2-Way Task Usage by Generations Generations must wait on each other in 2-way mode. ‣ Then reassign only as many compute tasks to domains in each remaining generation as is beneficial in minimizing the clocktimes of those generations by avoiding too small subdomains with too little computation being done and too costly halo exchanges. ‣ 15 All domains cannot execute concurrently so maximize the amount of work that can be done at any given time by assigning ALL compute tasks to the most expensive generation and distributing them among its domains for optimal efficiency. ‣

Rules for ‘Generational’ Task Usage A compute task can be in more than one generation but cannot be on more than one domain per generation. ‣ Generations execute sequentially. ‣ 16 ALL compute tasks are assigned to the most expensive generation. ‣ All domains in each generation execute concurrently. ‣ The user is now able to optimize speed in 2-way nesting while never imposing large imbalances. Some tasks might be idle in some generations but all generations are running as fast as possible. Each quilt task must still be uniquely assigned to a single domain to retain asynchronous writing of output. ‣

NMM-B with 2-Way Nesting using 72 Compute Tasks 17 generation #2 tasks 0-71 generation #1 tasks 0-11 generation #3 tasks 12-53 All 72 of 72 tasks working in the busiest generation. 42 4 4 8 56 12 ‘Generational’ task usage

Preliminary Estimate of 2-Way Compute Task Assignments Same setup as the 1-way case. 18 Domain #1 compute tasks: <= N Domain #2 compute tasks: (Work2 / TW2) x N Domain #3 compute tasks: (Work3 / TW2) x N Domain #4 compute tasks: <= (Work4 / TW3) x N Domain #5 compute tasks: <= (Work5 / TW3) x N Assume 2 nd generation is the most expensive. Distribute tasks in 2 nd generation as done for all 1-way domains previously. Assign as many of the N tasks to generations 1 and 3 as possible without slowing down the run. Total Work = TW2 = Work2 + Work3 gen #2: Total Work = TW3 = Work4 + Work5 gen #3: gen #1:

Example of 2-way Task Assignments You have 128 available tasks. ‣ 112 compute 116 write - - Five domains; 3 generations; 3 rd is most expensive. ‣ Dom #1 : ComputeWrite Dom #2 : Dom #3 : 7x8 1x4 Dom #4 : Dom #5 : 1x4 6x6 5x8 1x3 1x2 gen #2 gen #3 gen #1 = 128 = 112 = 16 19

Scaling 20 This happens when subdomain dimensions become too small and there is insufficient work to do compared to time spent in halo exchanges. Code efficiency drops as a task subdomain’s computation is overwhelmed by the cost of inter-processor communication. ● ●

Scaling (2) 21 More expensive computation => Code will scale to a larger number of processors. Less expensive computation => Code will scale to a smaller number of processors. Therefore scaling is simply a direct indicator of a code’s computational density. ●

Scaling (3) 22 However if minimization of clocktime is desired then extensive experimentation is required after first guess task assignments are made because optimal counts cannot be predicted. When assigning tasks always be sure the subdomains are not too small due to a task count that is too large. ● As a general rule of thumb check to see that no domain has a dimension less than ~10 points in I or J or else halo exchange cost will begin to exceed computational cost. ●

18 KM Parent 1080x486, Outer nest 181x181, Inner nest 361x361 48 hour simulation parent only INPESJNPESTotal tasksi-pointsj-pointselapsed timespeed up 624144180201370 122428890206871.994 24 57645203881.771 4824115223202421.603 9624230411201711.415 48 hour simulation parent and outer nest parentouter nest INPESJNPESTotal tasksi-pointsj-pointsINPESJNPESTotal tasksi-pointsj-pointselapsed time 48241152232016 25611 756 4824115223201624384117752 48241152232024 57677684 482411522320243276875822 48 hour simulation parent, outer nest and inner nest parentouter nestinner nest INPESJNPESTotal tasksi-pointsj-pointsINPESJNPESTotal tasksi-pointsj-pointsINPESJNPESTotal tasksi-pointsj-pointselapsed time 48241152232024 57677483215367112598 48241152232024 57677482411527152405 48241152232024 5767732 102411 2656

Parent-Oriented Nests The southwest H point of the nest domain coincides with a parent H point. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x xx x x x x x x x x x x xx x x Portion of Parent Domain Parent Task Subdomains ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Nest Task Subdomains 24

Summary of Parent-Child Gridpoint Relationships 25 Hh h h Hh v v v h h v Vv v h h v v v ODD space ratio Child h points lie on parent H points. Child v points lie on parent V points. Hh h Hh v h Vh h v EVEN space ratio Child h points lie on parent H and V points. Child v points do not coincide with parent points.

Parent and Child H Gridpoints for 3:1 Ratio 26 Hh h h Hh h h Hh h h Hh h h Hh h h Hh I_PARENT_SW=3 I=IDS=1 Parent point locations Nest point locations ITS_PARENT=1 ITE_PARENT=5 ITS_PARENT_ON_CHILD=-5 ITE_PARENT_ON_CHILD=9 gap 1 st point on parent task Last point on parent task SW corner of nest

Parent and Nest V Gridpoints for 3:1 Ratio 27 Hh h h Hh h h Hh h h Hh h h Hh h h Hh Parent point locations Nest point locations I=IDS=1 ITS_PARENT_ON_CHILD=-5 ITE_PARENT_ON_CHILD=9 gap Hh h h Hh h h Hh h h Hh h h Hh h h Hh v Vv v v Vv v v Vv v v Vv v v Vv v v Vv ITS_PARENT=1 on V ITS_PARENT=1 on H I_PARENT_SW=3 on V I_PARENT_SW=3 on H ITE_PARENT=5 on V ITE_PARENT=5 on H v v v v v v

NMM-B Moving Nests 28 1-way or 2-way interactive. ● Forecast can contain multiple nests. ● Telescoping domains. ●

Three Types of Data Motion Needed to Satisfy a Nest’s Shift Nest domain before shift Nest domain after shift Parent updates Intra-task Update Inter-Task Update Occupies the pre-move ‘footprint’ 29

Shift onto a Corner 30 Nest domain before shift Nest domain after shift Parent updates Intra-task update occupies the pre-move ‘footprint’

Simplest Parent Update Over the SW Corner 31 SW corner of pre-move footprint Here one parent task updates the entire parent update region of this nest task subdomain. Nest Task Subdomain

Four Parent Tasks Update Over the SW Corner 32 SW corner of pre-move footprint 1 st parent task’s update region 2nd parent task’s 1 st update region 2nd parent task’s 2nd update region 3rd parent task’s update region 4th parent task’s update region Nest Task Subdomain

Child’s Bookkeeping for Relative Motion The child tasks determine which of their points are updated by each of the three processes. Intra-task updating is the simplest (a shift in memory). ‣ Inter-task updating is more complex. ‣ Updates from the parent are the most complicated. ‣ Child tasks determine which of their subdomain points outside of the pre-move footprint will be updated by which parent tasks. Child tasks determine which of their subdomain points inside the pre-move footprint will be updated by which other child tasks and vice versa. 33

Parent’s Bookkeeping for Relative Child Motion The parent tasks perform bookkeeping to determine which nest points are updated by the parent outside of the pre-move footprint. Due to the complexity involved both the parent and child tasks perform this bookkeeping from their own perspectives to serve as checks on each other as well as to eliminate additional communication. 34

The Parent Stores Its Bookkeeping Results Child task subdomains and those points on them that are updated by a given parent task change with each shift of the nest. Use arrays of linked lists to deal with this continual change. Element 1 Moving Child #1 Element 2 Moving Child #2 Element 3 Moving Child #3 Parent array of moving nest update specifications Nest tasks to be updated Each link holds parent task update specifications for each relevant task of a moving child following a shift. 35

The Child Stores Its Bookkeeping Results There is no need for linked list arrays in storing the bookkeeping results from the child’s perspective since the number of parent tasks providing update data is always between 0 and 4. => Allocate a derived datatype array (1:4) and store appropriately. 36 This assumes the geographical area of parent task subdomains is always larger than that of child task subdomains.

Surface Data 37 Each nest task with a parent update region reads the external files to update those variables rather than receiving them from the parent so as not to lose the higher resolution information. Among these are topography, land/sea mask, soil type, vegetation type, and vegetation fraction. For sfc variables NOT among those eight: (a) Generate a search list of I,J increments from near to far. (b) If parent update sfc data is from a different surface type then the nest searches for its own nearest point with the same sfc type (e.g. soil T or SST). Eight invariant surface fields from NPS cover the uppermost parent domain at each different resolution of all moving nests. ‣ ‣ ‣ ‣

Upscale (2-way) Data Exchange As is done for motion both the child and the parent compute which parent tasks will receive upscale data from which points on which child tasks. This eliminates communication and serves as a check. 38

Upscale Exchange - Child Is the child at the end of a parent timestep? 39 If so, determine which points on which parent tasks it will update. Loop through the appropriate parent tasks. Generate upscale values using the mean of child values Loop through the specified 2-way variables. Send upscale data for all variables to the given parent task. (1) (2) (3) - - - within the stencil region.

Generate Upscale Values – Odd Space Ratio 40 h h h h v v v h h v Vv v h h v v v v v v v h h h v v h Hh h v v h h h H-pt variablesV-pt variables Average over these stencils

Generate Upscale Values – Even Space Ratio 41 h h h v h Vh h v h h h v h Hh h v H-pt variablesV-pt variables Average over these stencils

Upscale Exchange - Parent 42 Determine which of its points are updated by which child tasks. Save each child task’s specs as a link in a linked list (since we do not know ahead of time how many child tasks will send upscale data after each shift of moving nests). (1) Loop through the appropriate child tasks. (2) Recv data for all specified 2-way variables. - If the parent’s sfc elevation differs from the child’s then adjust - the data using a spline interpolation. Update the parent values applying the user-specified child - weight from the configure file. Incorporate data if the current timestep does not immediately - follow a restart output time (for bit identical restarts).

Specify Update Variables for BC, Motion, and 2-Way Exchange 43 Use the nests.txt file which (like solver_state.txt) lists desired variables from the Solver internal state. ● KEY for moving vbls: H – mass pt V – velocity pt L – land sfc W – water sfc F – read external file in parent update region x – parent must update halo when child moves KEY for 2-way vbls : H – mass pt V – velocity pt KEY for boundary vbls: H – mass pt V – velocity pt

Example of ‘nests.txt’ specifications 44 ### 2-D Integer ‘ISLTYP’ - F - ‘Soil type’ ### 2-D Real ‘FIS’ - F - ‘Sfc geopotential (m2 s-2)’ ‘CMC’ - Lx - ‘Canopy moisture (m)’ ‘SST’ - Wx - ‘Sea surface temperature (K)’ ### 3-D Real ‘T’ H H H ‘Sensible temperature (K)’ ‘U’ V V V ‘U component of wind (m s-1)’ ‘STC’ - Lx - ‘Soil temperature (K)’ Moving2-way###BC

High Level Order of Execution 45 Children recv BC updates from parents from the end of the current parent timestep. Parents recv upscale data from children from the end of the previous parent timestep. Domain integrates Parents send BC updates to children who are at the beginning of the current parent timestep. Children send upscale data to parents who recv it at the beginning of the next parent timestep. ► ► ► ► ► Timestepping loop in subroutine NMM_INTEGRATE

Run Step of the NMM CALL phase 2 Parent-Child Coupler Run ( children recv BCs from parents ) DO Loop over all (1-way) or some (2-way) forecast timesteps CALL phase 1 Domain Run ( integrate the forecast one timestep ) CALL phase 5 Parent-Child Coupler Run ( children send upscale to parents ) CALL phase 3 Domain Run ( write history/restart ) ENDDO Timestep loop Advance the Clock CALL phase 3 Parent-Child Coupler Run ( parents recv upscale from children ) CALL phase 4 Parent-Child Coupler Run ( parents send BCs to children ) DO Loop over generations (a single iteration for 1-way interaction) ENDDO Generations loop 46 CALL phase 1 Parent-Child Coupler Run ( check 2-way signals ) CALL phase 2 Domain Run ( digital filter )

Example of erratic nest motions 47 due to weak storm(s) interacting with complex terrain. Note how the wind field remains coherent as it evolves within the outer and inner nest domains.

Additional Slides

The Composite Object A1 A derived datatype to hold assorted variables used ● throughout the Parent-Child coupler component. Allows tasks lying on multiple domains to easily ● reference such variables generically when they have different values on different domains.

Composite Object – Defined / Allocated A2 TYPE COMPOSITE INTEGER(kind=KINT),DIMENSION(1:3) :: PARENT_SHIFT END TYPE COMPOSITE SUBROUTINE PARENT_CHILD_COUPLER_SETUP TYPE(COMPOSITE), DIMENSION(:), POINTER, SAVE :: CPL_COMPOSITE ALLOCATE(CPL_COMPOSITE(1:NUM_DOMAINS),stat=ISTAT) Top of module before CONTAINS INTEGER(kind=KINT),DIMENSION(:),POINTER :: PARENT_SHIFT END SUBROUTINE PARENT_CHILD_COUPLER_SETUP

Composite Object - Used A3 SUBROUTINE CHILDREN_RECV_PARENT_DATA CALL POINT_TO_COMPOSITE(MY_DOMAIN_ID) CALL MPI_RECV( PARENT_SHIFT, 3, MPI_INTEGER, ……. END SUBROUTINE CHILDREN_RECV_PARENT_DATA SUBROUTINE POINT_TO_COMPOSITE(MY_DOMAIN_ID) TYPE(COMPOSITE), POINTER :: CC CC => CPL_COMPOSITE(MY_DOMAIN_ID) PARENT_SHIFT => CC%PARENT_SHIFT END SUBROUTINE POINT_TO_COMPOSITE

1-Way Communication Between a Parent and Child MPI intercommunicators are very convenient for this. ‣ The lead tasks on both domains have rank 0. ‣ MPI sends/recvs use simple target and sender task ranks. ‣ A4

Example of an Intercommunicator The global task ranks (unique task assignments to domains): Parent – 25, 26, 27 Child – 52, 53, 54, 55 The intercommunicator task ranks: Parent – 0, 1, 2 Child – 0, 1, 2, 3 A5

Parent and Child Communications w/ Generations MPI intercommunicators cannot be used because parent and child may share some of the same tasks. MPI does not allow global task ranks to be repeated in intercommunicators. ‣ Therefore we use MPI intracommunicators. ‣ Parent/child task ranks may repeat but will lie in a single non-repeating sequence in the communicator. ‣ A6

Example of an Intracommunicator The global task ranks (tasks can be in more than 1 generation): Parent – 3, 4, 5, 6 Child – 1, 2, 3, 4, 5, 6, 7 The intracommunicator task ranks (parent first): Union – 3, 4, 5, 6, 1, 2, 7 -> 0, 1, 2, 3, 4, 5, 6 Parent – 0, 1, 2, 3 Child – 4, 5, 0, 1, 2, 3, 6 A7 More bookkeeping for the Init step. Variable sources/targets in MPI sends/recvs.

v v v H H H v v v H H H v v v H H H v v v H H H v v v H H H v v v H H H B-grid dx and dy E-grid dx and dy B-grid E-grid B-grid vs. E-grid B-grid is just a rotated E-grid A8

Overview of Nesting in the NMM-B Tom Black 1 April 2015 NMM-B User Tutorial.

Similar presentations

Presentation on theme: "Overview of Nesting in the NMM-B Tom Black 1 April 2015 NMM-B User Tutorial."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview of Nesting in the NMM-B Tom Black 1 April 2015 NMM-B User Tutorial.

Similar presentations

Presentation on theme: "Overview of Nesting in the NMM-B Tom Black 1 April 2015 NMM-B User Tutorial."— Presentation transcript:

Similar presentations

About project

Feedback