1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 14: Simulations 1.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
1 CSIS 7102 Spring 2004 Lecture 9: Recovery (approaches) Dr. King-Ip Lin.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,
MIS 2000 Class 20 System Development Process Updated 2014.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
CPSC 668Set 14: Simulations1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
1 The SOCK SAGA Ivan Lanese Computer Science Department University of Bologna Italy Joint work with Gianluigi Zavattaro.
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Resource Management – a Solution for Providing QoS over IP Tudor Dumitraş, Frances Jen-Fung Ning and Humayun Latif.
1 CSIT431 Introduction to Operating Systems Welcome to CSIT431 Introduction to Operating Systems In this course we learn about the design and structure.
UvA, Amsterdam June 2007WS-VLAM Introduction presentation WS-VLAM Requirements list known as the WS-VLAM wishlist System and Network Engineering group.
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
HW/SW/FW Allocation – Page 1 of 14CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Allocation of Hardware, Software, and Firmware.
Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.
Threads, Thread management & Resource Management.
Intro to Architecture – Page 1 of 22CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Introduction Reading: Chapter 1.
Clone-Cloud. Motivation With the increasing use of mobile devices, mobile applications with richer functionalities are becoming ubiquitous But mobile.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Java Threads. What is a Thread? A thread can be loosely defined as a separate stream of execution that takes place simultaneously with and independently.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Introduction CS 3358 Data Structures. What is Computer Science? Computer Science is the study of algorithms, including their  Formal and mathematical.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Black-box Testing.
Processes Introduction to Operating Systems: Module 3.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
1 CS161 Introduction to Computer Science Topic #9.
By Rashid Khan Lesson 6-Building a Directory Service.
Structural Patterns1 Nour El Kadri SEG 3202 Software Design and Architecture Notes based on U of T Design Patterns class.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
A Generalized Architecture for Bookmark and Replay Techniques Thesis Proposal By Napassaporn Likhitsajjakul.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Design Engineering 1. Analysis  Design 2 Characteristics of good design 3 The design must implement all of the explicit requirements contained in the.
Testability.
TensorFlow– A system for large-scale machine learning
7. Modular and structured design
Code Optimization.
The Development Process of Web Applications
CnC: A Dependence Programming Model
SOFTWARE DESIGN AND ARCHITECTURE
System Structure and Process Model
System Structure and Process Model
The ANSI/SPARC Architecture aka the 3 Level Architecture
System Structure B. Ramamurthy.
Machine Independent Features
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Objective of This Course
An Introduction to Software Architecture
CS510 Operating System Foundations
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

1 Software and Services Group 1 Execution Frontiers CnC support for highly adaptive execution Kath Knobe Intel 12/07/12

2 Software and Services Group 2 Warning This is all high level conceptual thinking Many details to be determined Today: just the basic idea without any concern for efficiency. Lots of room for optimizing Suggestions /comments more than welcome!

3 Software and Services Group 3 Motivation: Highly adaptive computing for exascale Critical exascale issues (inspired by work on UHPC and X-Stack) Require the ability to move currently executing parts of the app to another place in the platform or to a later time. Resilience −Fragile components −Lots of them Power management −Power components off −Power components down Self-aware computing −Modify mapping based on feedback Change of goals −Between power and time to solution, for example Thesis: management of the execution frontiers in CnC is a mechanism supporting highly adaptive computing for exascale.

4 Software and Services Group 4 Checkpoint/restartHierarchical CnC Hierarchical checkpoint/restart For adaptive execution 2 passes - Abstract: unlimited resources - Actual: with resource constraints For faults

5 Software and Services Group 5 Outline Abstract (platform has infinite memory and processors) −Semantic state −Checkpoint/restart −Hierarchical CnC −Hierarchical checkpoint/restart Actual (with resource constraints) Beyond faults

6 Software and Services Group 6 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

7 Software and Services Group 7 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

8 Software and Services Group 8 Semantics / execution model Item availItem tag tag

9 Software and Services Group 9 Semantics / execution model Item availItem step controlReadystep step dataReadystep tag availtag

10 Software and Services Group 10 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep tag availtag

11 Software and Services Group 11 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep tag availtag

12 Software and Services Group 12 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep tag availtag

13 Software and Services Group 13 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep step executedstep tag avail tag avail

14 Software and Services Group 14 Semantics / execution model Item availItem step controlReadystep step readystep step dataReadystep step executedstep tag availtag The primitive attributes come from below: available, executed The derived attributes propagate at this level: control_ready, data_ready, ready 2 levels: Graph level (above) User serial code level (below)

15 Software and Services Group 15 Execution frontier An execution frontier is a CnC program state: −The set of attributes of instances of steps, tags and items −The contents of available items CnC execution can proceed from a execution frontier Some examples of execution frontiers: − Normal program input (set of available items and tags) − Normal program output (set of available items and tags) − Any state during execution (more general) Perspective −Traditional focus: >Data structure is items; computation is step. >step instance consumes and produces items. −Alternate view: >Data structure is execution frontier; computation is step, subgraph or full program. >Applying a computation to an execution frontier yields another execution frontier.

16 Software and Services Group 16 Outline Abstract −Semantic state −Checkpoint/restart −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

17 Software and Services Group 17 Checkpoint/restart summary (abstract) Changes to the execution frontier are saved continuously as they occur Changes are saved in less volatile “place” Asynchronous, no barriers No programmer involvement Saved state may not correspond to an actual state Can restart from any saved state

18 Software and Services Group 18 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

19 Software and Services Group 19 Cholesky domain spec TrisolveTag: row, iter CholeskyTag: iter UpdateTag: col, row, iter CONTROL TAG Cholesky: iter Trisolve: row, iter Update: col, row, iter COMPUTE STEP Array : col, row, iter DATA ITEM

20 Software and Services Group 20 Looks like a CnC spec at each level CONTROL TAG COMPUTE STEP (C: iter) COMPUTE STEP (C: iter)

21 Software and Services Group 21 Looks like a CnC spec at each level iterations CONTROL TAG COMPUTE STEP (cholesky:) COMPUTE STEP (cholesky:) COMPUTE STEP (C: iter) COMPUTE STEP (C: iter) COMPUTE STEP (TU:) COMPUTE STEP (TU:)

22 Software and Services Group 22 Looks like a CnC spec at each level CONTROL TAG COMPUTE STEP (C: iter) COMPUTE STEP (C: iter) COMPUTE STEP (U:) COMPUTE STEP (U:) COMPUTE STEP (trisolve) COMPUTE STEP (trisolve) CONTROL TAG COMPUTE STEP (cholesky:) COMPUTE STEP (cholesky:) COMPUTE STEP (TU:) COMPUTE STEP (TU:)

23 Software and Services Group 23 get… … =.. + … *… / … = … if … put Executed semantics: leaf COMPUTE STEP (trisolve: row) COMPUTE STEP (trisolve: row) Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below

24 Software and Services Group 24 Executed semantics: non-leaf COMPUTE STEP (U:) COMPUTE STEP (U:) COMPUTE STEP (trisolve) COMPUTE STEP (trisolve) CONTROL TAG COMPUTE STEP (TU:) COMPUTE STEP (TU:) Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below - non-leaf: termination of the subgraph below

25 Software and Services Group 25 Hierarchical CnC application: execution is at the leaves only Cholesky trisolve update

26 Software and Services Group 26 Hierarchical CnC application: intermediate nodes maintain state State of each iteration State of each row

27 Software and Services Group 27 Hierarchical view of the abstract platform tree A node looks like a full machine at each level: a subtree of the memory hierarchy + the associated set of cores Hierarchical platform node

28 Software and Services Group 28 Abstract platform: Depth and extent of platform hierarchy corresponds exactly to the depth and extent of the dynamic application The mapping is direct

29 Software and Services Group 29 Outline Abstract −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Actual Beyond faults

30 Software and Services Group 30 Hierarchical checkpoint / restart (abstract) Hierarchical application node

31 Software and Services Group 31 Hierarchical checkpoint/restart (abstract) Checkpoint for that application node Hierarchical application node

32 Software and Services Group 32 Hierarchical checkpoint/restart (abstract) Checkpoint for that application node resides at the parent place Hierarchical application node

33 Software and Services Group 33 Hierarchical checkpoint/restart (abstract) Checkpoint for that application node resides at the parent place Hierarchical application node Distinct checkpoints residing at a single place remain separate. We will see why later.

34 Software and Services Group 34 Abstract failure model The system knows if/when a node fails −We’re not talking about soft errors Abstract platform node fails temporarily then returns

35 Software and Services Group 35 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

36 Software and Services Group 36 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

37 Software and Services Group 37 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

38 Software and Services Group 38 Hierarchical checkpoint/restart (abstract) 1-level Checkpoint Fault Fullstop Restart

39 Software and Services Group 39 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

40 Software and Services Group 40 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

41 Software and Services Group 41 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

42 Software and Services Group 42 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart

43 Software and Services Group 43 Hierarchical checkpoint/restart (abstract) Checkpoint in hierarchy Fault Fullstop Restart From above: step simply looks like it took longer than expected. Checkpoint/fullstop at one node looks like checkpoint/continue for the whole program

44 Software and Services Group 44 Hierarchical checkpoint/restart: Summary Each node in a hierarchy has all the characteristics of a whole program checkpoint. Checkpoint/fullstop/restart at nodes in the hierarchy enables the application as a whole to adapt and continue through faults.

45 Software and Services Group 45 Outline Abstract Actual: with resources and resource constraints −Semantic state −Checkpoint/restore −Hierarchical CnC −Hierarchical checkpoint/restart Beyond faults

46 Software and Services Group 46 Semantic state for execution (limited memory) Checkpointed information leaves the trailing edge of the execution frontier −Dead tags −Dead items −Dead steps This is the motivation for the term “execution frontier” as opposed to “execution state”. It’s only the relevant frontier of the state. Dead is a derived attribute. It doesn’t propagate up from the children. It is derived independently within each (sub)program.

47 Software and Services Group 47 Hierarchical CnC map to actual platform platform: limited depth / limited extent at each level Platform hierarchy Application hierarchy

48 Software and Services Group 48 Hierarchical CnC map to actual platform flatten the depth Platform hierarchy Application hierarchy

49 Software and Services Group 49 Hierarchical CnC map to actual platform fold extent Platform hierarchy Application hierarchy

50 Software and Services Group 50 Actual failure model Platform node fails and may not return − or don’t want to wait until it returns Restart is at some other platform node

51 Software and Services Group 51 Remapping A B Map:

52 Software and Services Group 52 Remapping A B A B Map:

53 Software and Services Group 53 Remapping X A B Y A B Map:Original checkpoint of B is at X New checkpoint of B is at Y Follows the new platform location A B AB

54 Software and Services Group 54 Remapping X A B Y A B Map:Original checkpoint of B is at X New checkpoint of B is at Y Follows the new platform location A B AB This is why we don’t want to merge checkpoints of the application children at the platform parent. We may want to relocate each child independently.

55 Software and Services Group 55 What do we have? A way of maintaining the execution frontier of −A running application −A running subgraph of an application A mechanism for taking an execution frontier and moving it −To another place −To a later time Use of this to cope with faults

56 Software and Services Group 56 Outline Abstract Actual: with resources and resource constraints Beyond faults

57 Software and Services Group 57 Adaptive execution If we can checkpoint and continue elsewhere on a fault, we can checkpoint and continue elsewhere for our own reasons. Big relevant exascale issues: −Resilience Actual/predicted failures −Power management −Self-aware computing −Changes in goals Mechanism not policy! Status: −No staffing or funding yet.

58 Software and Services Group 58 Other uses of execution frontiers Mechanism for connecting reusable components Low priority app − Execute/checkpoint/restart one step at a time − Stop mid-step when high priority work arrives Long-lived app with very slowly arriving input − e.g., phylogenetic tree for SARS virus Debugging − View state − Reverse time (undo) Soft-errors −Compute more than once. Compare Something like out-of-core computation but not baked into application

59 Software and Services Group 59 Potential: Forms & operations Forms As executing − general, arrays, trees… Serialized Streaming Encrypted Compressed Database Excel Human readable Operations Save/restore Partition/specialize −At fork into distinct large subgraphs Merge −At join of distinct large subgraphs Send Compare (e.g., for fault tolerance) Explicitly modify (e.g., debug) Rename collections (e.g., for composition

60 Software and Services Group 60 Relook at motivation: Highly adaptive computing for exascale Critical exascale issues: require the ability to move currently executing parts of the app to another place in the platform or to a later time. Resilience −Fragile components −Lots of them Power management −Power components off −Power components down Self-aware computing −Modify mapping based on feedback Change of goals −Between power and time to solution, for example Looking forward to: Lowering the design Implementation Experimenting Looking for feedback and collaborators

61 Software and Services Group 61