Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12.

Similar presentations


Presentation on theme: "1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12."— Presentation transcript:

1 1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

2 2 Software and Services Group 2 Warning This is all high level conceptual thinking Many details to be determined Today: just the basic idea without any concern for efficiency. Lots of room for optimizing Suggestions /comments more than welcome!

3 3 Software and Services Group 3 Motivation: Highly adaptive computing for exascale Critical exascale issues (inspired by work on UHPC and X-Stack) Require the ability to move currently executing parts of the app to another place in the platform or to a later time. Resilience −Fragile components −Lots of them Power management −Power components off −Power components down Self-aware computing −Modify mapping based on feedback Change of goals −Between power and time to solution, for example Thesis: management of the execution frontiers in CnC is a mechanism supporting highly adaptive computing for exascale.

4 4 Software and Services Group 4 Checkpoint/restartHierarchical CnC Hierarchical checkpoint/restart For adaptive execution 2 passes - Abstract: unlimited resources - Actual: with resource constraints For faults

5 5 Software and Services Group 5 Outline Abstract (platform has infinite memory and processors) −Semantic state −Checkpoint/restart −Hierarchical CnC −Hierarchical checkpoint/restart Actual (with resource constraints) Beyond faults

6 6 Software and Services Group 6 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

7 7 Software and Services Group 7 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

8 8 Software and Services Group 8 Semantics / execution model Item availItem tag tag

9 9 Software and Services Group 9 Semantics / execution model Item availItem step controlReadystep step dataReadystep tag availtag

10 10 Software and Services Group 10 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep tag availtag

11 11 Software and Services Group 11 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep tag availtag

12 12 Software and Services Group 12 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep tag availtag

13 13 Software and Services Group 13 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep step executedstep tag avail tag avail

14 14 Software and Services Group 14 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep step executedstep tag availtag The primitive attributes come from below: available, executed The derived attributes propagate at this level: control_ready, data_ready, ready 2 levels: Graph level (above) User serial code level (below)

15 15 Software and Services Group 15 Execution frontier An execution frontier is a CnC program state: −The set of attributes of instances of steps, tags and items −The contents of available items CnC execution can proceed from a execution frontier Some examples of execution frontiers: − Normal program input (set of available items and tags) − Normal program output (set of available items and tags) − Any state during execution (more general) Perspective −Traditional focus: >Data structure is items; computation is step. >step instance consumes and produces items. −Alternate view: >Data structure is execution frontier; computation is step, subgraph or full program. >Applying a computation to an execution frontier yields another execution frontier.

16 16 Software and Services Group 16 Outline Abstract −Semantic state −Checkpoint/restart −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

17 17 Software and Services Group 17 Checkpoint/restart summary (abstract) Changes to the execution frontier are saved continuously as they occur Changes are saved in less volatile “place” Asynchronous, no barriers No programmer involvement Saved state may not correspond to an actual state Can restart from any saved state

18 18 Software and Services Group 18 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

19 19 Software and Services Group 19 Cholesky domain spec TrisolveTag: row, iter CholeskyTag: iter UpdateTag: col, row, iter CONTROL TAG Cholesky: iter Trisolve: row, iter Update: col, row, iter COMPUTE STEP Array : col, row, iter DATA ITEM

20 20 Software and Services Group 20 Looks like a CnC spec at each level CONTROL TAG COMPUTE STEP (C: iter) COMPUTE STEP (C: iter)

21 21 Software and Services Group 21 Looks like a CnC spec at each level iterations CONTROL TAG COMPUTE STEP (cholesky:) COMPUTE STEP (cholesky:) COMPUTE STEP (C: iter) COMPUTE STEP (C: iter) COMPUTE STEP (TU:) COMPUTE STEP (TU:)

22 22 Software and Services Group 22 Looks like a CnC spec at each level CONTROL TAG COMPUTE STEP (C: iter) COMPUTE STEP (C: iter) COMPUTE STEP (U:) COMPUTE STEP (U:) COMPUTE STEP (trisolve) COMPUTE STEP (trisolve) CONTROL TAG COMPUTE STEP (cholesky:) COMPUTE STEP (cholesky:) COMPUTE STEP (TU:) COMPUTE STEP (TU:)

23 23 Software and Services Group 23 get… … =.. + … *… / … = … if … put Executed semantics: leaf COMPUTE STEP (trisolve: row) COMPUTE STEP (trisolve: row) Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below

24 24 Software and Services Group 24 Executed semantics: non-leaf COMPUTE STEP (U:) COMPUTE STEP (U:) COMPUTE STEP (trisolve) COMPUTE STEP (trisolve) CONTROL TAG COMPUTE STEP (TU:) COMPUTE STEP (TU:) Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below - non-leaf: termination of the subgraph below

25 25 Software and Services Group 25 Hierarchical CnC application: execution is at the leaves only Cholesky trisolve update

26 26 Software and Services Group 26 Hierarchical CnC application: intermediate nodes maintain state State of each iteration State of each row

27 27 Software and Services Group 27 Hierarchical view of the abstract platform tree A node looks like a full machine at each level: a subtree of the memory hierarchy + the associated set of cores Hierarchical platform node

28 28 Software and Services Group 28 Abstract platform: Depth and extent of platform hierarchy corresponds exactly to the depth and extent of the dynamic application The mapping is direct

29 29 Software and Services Group 29 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

30 30 Software and Services Group 30 Hierarchical checkpoint / restart (abstract) Hierarchical application node

31 31 Software and Services Group 31 Hierarchical checkpoint/restart (abstract) Checkpoint for that application node Hierarchical application node

32 32 Software and Services Group 32 Hierarchical checkpoint/restart (abstract) Checkpoint for that application node resides at the parent place Hierarchical application node

33 33 Software and Services Group 33 Hierarchical checkpoint/restart (abstract) Checkpoint for that application node resides at the parent place Hierarchical application node Distinct checkpoints residing at a single place remain separate. We will see why later.

34 34 Software and Services Group 34 Abstract failure model The system knows if/when a node fails −We’re not talking about soft errors Abstract platform node fails temporarily then returns

35 35 Software and Services Group 35 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

36 36 Software and Services Group 36 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

37 37 Software and Services Group 37 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

38 38 Software and Services Group 38 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

39 39 Software and Services Group 39 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

40 40 Software and Services Group 40 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

41 41 Software and Services Group 41 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

42 42 Software and Services Group 42 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

43 43 Software and Services Group 43 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart From above: step simply looks like it took longer than expected. Checkpoint/fullstop at one node looks like checkpoint/continue for the whole program

44 44 Software and Services Group 44 Hierarchical checkpoint/restart: Summary Each node in a hierarchy has all the characteristics of a whole program checkpoint. Checkpoint/fullstop/restart at nodes in the hierarchy enables the application as a whole to adapt and continue through faults.

45 45 Software and Services Group 45 Outline Abstract Actual: with resources and resource constraints −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Beyond faults

46 46 Software and Services Group 46 Semantic state for execution (limited memory) Checkpointed information leaves the trailing edge of the execution frontier −Dead tags −Dead items −Dead steps This is the motivation for the term “execution frontier” as opposed to “execution state”. It’s only the relevant frontier of the state. Dead is a derived attribute. It doesn’t propagate up from the children. It is derived independently within each (sub)program.

47 47 Software and Services Group 47 Hierarchical CnC map to actual platform platform: limited depth / limited extent at each level Platform hierarchy Application hierarchy

48 48 Software and Services Group 48 Hierarchical CnC map to actual platform flatten the depth Platform hierarchy Application hierarchy

49 49 Software and Services Group 49 Hierarchical CnC map to actual platform fold extent Platform hierarchy Application hierarchy

50 50 Software and Services Group 50 Actual failure model Platform node fails and may not return − or don’t want to wait until it returns Restart is at some other platform node

51 51 Software and Services Group 51 Remapping A B Map:

52 52 Software and Services Group 52 Remapping A B A B Map:

53 53 Software and Services Group 53 Remapping X A B Y A B Map:Original checkpoint of B is at X New checkpoint of B is at Y Follows the new platform location A B AB

54 54 Software and Services Group 54 Remapping X A B Y A B Map:Original checkpoint of B is at X New checkpoint of B is at Y Follows the new platform location A B AB This is why we don’t want to merge checkpoints of the application children at the platform parent. We may want to relocate each child independently.

55 55 Software and Services Group 55 What do we have? A way of maintaining the execution frontier of −A running application −A running subgraph of an application A mechanism for taking an execution frontier and moving it −To another place −To a later time Use of this to cope with faults

56 56 Software and Services Group 56 Outline Abstract Actual: with resources and resource constraints Beyond faults

57 57 Software and Services Group 57 Adaptive execution If we can checkpoint and continue elsewhere on a fault, we can checkpoint and continue elsewhere for our own reasons. Big relevant exascale issues: −Resilience Actual/predicted failures −Power management −Self-aware computing −Changes in goals Mechanism not policy! Status: −No staffing or funding yet.

58 58 Software and Services Group 58 Other uses of execution frontiers Mechanism for connecting reusable components Low priority app − Execute/checkpoint/restart one step at a time − Stop mid-step when high priority work arrives Long-lived app with very slowly arriving input − e.g., phylogenetic tree for SARS virus Debugging − View state − Reverse time (undo) Soft-errors −Compute more than once. Compare Something like out-of-core computation but not baked into application

59 59 Software and Services Group 59 Potential: Forms & operations Forms As executing − general, arrays, trees… Serialized Streaming Encrypted Compressed Database Excel Human readable Operations Save/restore Partition/specialize −At fork into distinct large subgraphs Merge −At join of distinct large subgraphs Send Compare (e.g., for fault tolerance) Explicitly modify (e.g., debug) Rename collections (e.g., for composition

60 60 Software and Services Group 60 Relook at motivation: Highly adaptive computing for exascale Critical exascale issues: require the ability to move currently executing parts of the app to another place in the platform or to a later time. Resilience −Fragile components −Lots of them Power management −Power components off −Power components down Self-aware computing −Modify mapping based on feedback Change of goals −Between power and time to solution, for example Looking forward to: Lowering the design Implementation Experimenting Looking for feedback and collaborators

61 61 Software and Services Group 61


Download ppt "1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12."

Similar presentations


Ads by Google