Presentation is loading. Please wait.

Presentation is loading. Please wait.

Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve.

Similar presentations


Presentation on theme: "Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve."— Presentation transcript:

1 Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve Nordstrom, Divya Vashishtha, Jason Scott, Jason Overdorf Vanderbilt Univ.

2 BTeV RTES Team NSF/ITR Fermilab –Building BTeV Trigger Hardware –Domain Experts, Define Goals, Constraints, etc. Vanderbilt –RTES Lead (Physics) –Design Environment, System Synthesis, System Integration, Prototype Hardware UIUC –ARMOR, Fault Tolerant Middleware Syracuse & Pitt –Very Lightweight Agents, Diagnostics, Load Balancing

3 High Energy Physics FermiLab Accelerator BTeV Experiment

4 Particle Measurement 0 12 m p p Dipole RICH EM Cal Hadron Absorber Muon Toroid ± 300 mrad Magnet Forward tracker provides: Momentum measurement Pattern recognition for tracks born in decays downstream of vertex detector Projection of tracks into particle ID devices Detector Grids Problem: -Massive amounts of data (Terabytes/Sec) -Determine the set of particle trajectories -Decide if it is interesting, keep or toss -Hardware => 2500 DSP’s + 2500 PC’s -Never Fail (ok to degrade)

5 Trigger System (20,000 ft. view) Memory Queue, ~ms 2nd Level (PC) Pre- Process (FPGA) 1st Level (DSP) Store ~2000 Nodes ~2000 Nodes

6 System Constraints Triple-Mode Redundancy == Too Expensive –Some Over-capacity designed in Parallel System, Real-Time –Heterogeneous Processors –RT Constraints = Queue Length. No Generic Response to Faults –Based on application requirements –Based on system state –Based on available resources

7 Fault Mitigation System has excess capacity –But not much… (~10%) –Cannot pre-plan use of redundancy –Excess capacity may be used for ‘disposable tasks’ Fault Occurs: –React quickly to regain minimal function –Rearrange Resources to make Best Use of Remaining Resources –User-defined recovery behavior

8 Reflex + Healing ‘Reflex Action’: –Simple, –Rapid, –Real-Time, Guaranteed Response Time, –Sub-Optimal –Handle a Single Failure ‘Healing’ –Re-Evaluate Resources & Tasks –Re-Balance/Re-Allocate Resources –Recover Failed Resources (After Testing) –Generate New Reflex Actions

9 Reflex Mitigation Example Primary Task Secondary Task 1. Normal Operation 2. Processor Failure 3. Subdivide Primary Task 5. Replace Secondary Task 4. Migrate to Adjacent Processors User-Defined Mitigation Actions 5 +. Reset/Test Failed Processor

10 Healing Mitigation Example Primary Task Secondary Task 1. Normal Operation 2. Processor Failure Reflex Action 3. Update Models 5. Re-Plan System 4. Re-Evaluate Resources Mitigation Actions 6. Rearrange tasks Re-Eval Re-Plan

11 Design Issues Complex System –Thousands of Processors –High Data Rates –Real-Time Constraints User-Defined Behaviors –Domain-Specific Design Tool –System-Specific Implementation Run-Time Implementation –Heterogeneous Architecture –Real-Time - Execution & Mitigation –Fault-Tolerant

12 Analysis Local Oper. Manager Local Fault Mgr Trig Algo. ARMOR/RTOS Trig Algo. Trig Algo. Trig Algo. Logical Control Network L1/ DSP Local Oper. Manager Local Fault Mgr Trig Algo. ARMOR/RTOS Trig Algo. Trig Algo. Trig Algo. Logical Data Net DSP Local Oper Manager Local Fault Mgr Trig Algo. ARMOR/Linux Trig Algo. Trig Algo. Trig Algo. Logical Data Net Logical Control Network RISC Local Oper Manager Local Fault Mgr Trig Algo. ARMOR/Linux Trig Algo. Trig Algo. Trig Algo. L2,3/ RISC Region Operations Mgr Region Fault Mgr Runtime Design and Analysis Reconfig Behavior Algorithm Fault Behavior Resource Synthesis Performance Simulation Diagnosability Analysis Reliability Analysis System Models Soft Real-Time Hard Experiment Interface Synthesis Feedback Model Integrated Computing Logical Control Network Global Operations Manager Global Fault Manager

13 Modeling Language Processing & Data Flow Concepts: Processes, streams, data channels, Functions, data types, communication Hardware Resources Concepts: Processors, Memory, Topology, Reliability, Failure Modes,… Full Recov. Mode 1 Recov Mode 3 Recov. Mode 2 Concepts: Recovery Strategies, Modes of Operation, goals/importance Hierarchical Fault Management

14 Resource Models = Capture Hardware Resources Nodes Networks – Attributes – Hierarchy

15 Algorithm Models Processes Info Flow ° Interfaces ° Hierarchy

16 Fault Mitigation Models Regional Manager Local Manager Finite State Machine Parallel, Hierarchical Events & Transitions Mitigation Actions Time Specs State Transition Mitigation Actions Conditions

17 System Generation Algorithms Schedules Comm Maps SW Loads Resources OS Cfg Task Assign Boot Maps

18 Generation of Reflex Networks State A State CState B Action AC1 Action AC2 Action AC3 Action AB1 Action AB2 Action AB3 System Fault State Primary Struct. Reflex Struct. ON (L76 Fail) DO Del P1 Conn P1,S22 Map S22, C3 Kill T22 Migrate T33, Reflex Scripts * 1 Set for Each Processor And failure type

19 Model-Based Healing System Hardware MIC Healing Controller Nominal Faults Update Model Re- Balance New Reflex Interface Tasks, Links Reflex Program Reflex

20 Runtime Environment Global Manager Reflex ActionsMitigation Engine Regional Manager Mitigation Engine DSP Kernel Local Manager Mitigation Engine Actions Feed Back Actions Feed Back Reflex Actions Experiment Interface Model Interface DSP Hardware

21 Fault Mitigation Interface Fault Mitigation Interface: –The FMA interfaces with the local diagnostics facility (receive local status, clear errors, trigger rediagnosis, set diagnosis mode, etc. Commands –RETRY_LINK(link_id) Function: Reset/resync a comm link, Returns: failure or success –REROUTE_LINK(link_id) Function: Reroute communications through a separate link –ADD_TASK(task_id, link_id) Function: Adds a task to the task list, operate on data from link_id –TEST_MEMORY(memory_bank) Function: Intensive test on memory bank –RELOCATE_DATA(from_bank, to_bank) Function: Moves data, marks source memory bank as unused/unavail –GET_LOCAL_STATUS Function: Reports status of a resource on a local node –SEND_MESSAGE –RECEIVE_MESSAGE –...

22 Synthesis: Analysis/Offline Simulation –Functional (e.g. Matlab) –Performance (Timing, Discrete Event) –Interfacing/generating to Swarms/Jackal/TAEMS Diagnosability –Failure Modes + Sensors –Predict ability to Detect/Isolate Failures Reliability Analysis –Predict MTBF, Maximum Failures –Robustness Stability Analysis –Reconfiguration Strategies/Control System

23 System Simulation System Model Task Model Communication Model

24 Summary Developing Model-Based Approach –Capture Algorithm, Resource, and Mitigation Aspects Generation of Software –Normal application Code –Fault Mitigation Code Two Fault Mitigation Approaches –Reflex: Fast, Limited Response –Healing: Slower, system ‘re-design’ Analysis & Simulation Runtime Infrastructure


Download ppt "Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve."

Similar presentations


Ads by Google