Download presentation
Presentation is loading. Please wait.
Published byJulissa Blowers Modified over 9 years ago
1
Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve Nordstrom, Divya Vashishtha, Jason Scott, Jason Overdorf Vanderbilt Univ.
2
BTeV RTES Team NSF/ITR Fermilab –Building BTeV Trigger Hardware –Domain Experts, Define Goals, Constraints, etc. Vanderbilt –RTES Lead (Physics) –Design Environment, System Synthesis, System Integration, Prototype Hardware UIUC –ARMOR, Fault Tolerant Middleware Syracuse & Pitt –Very Lightweight Agents, Diagnostics, Load Balancing
3
High Energy Physics FermiLab Accelerator BTeV Experiment
4
Particle Measurement 0 12 m p p Dipole RICH EM Cal Hadron Absorber Muon Toroid ± 300 mrad Magnet Forward tracker provides: Momentum measurement Pattern recognition for tracks born in decays downstream of vertex detector Projection of tracks into particle ID devices Detector Grids Problem: -Massive amounts of data (Terabytes/Sec) -Determine the set of particle trajectories -Decide if it is interesting, keep or toss -Hardware => 2500 DSP’s + 2500 PC’s -Never Fail (ok to degrade)
5
Trigger System (20,000 ft. view) Memory Queue, ~ms 2nd Level (PC) Pre- Process (FPGA) 1st Level (DSP) Store ~2000 Nodes ~2000 Nodes
6
System Constraints Triple-Mode Redundancy == Too Expensive –Some Over-capacity designed in Parallel System, Real-Time –Heterogeneous Processors –RT Constraints = Queue Length. No Generic Response to Faults –Based on application requirements –Based on system state –Based on available resources
7
Fault Mitigation System has excess capacity –But not much… (~10%) –Cannot pre-plan use of redundancy –Excess capacity may be used for ‘disposable tasks’ Fault Occurs: –React quickly to regain minimal function –Rearrange Resources to make Best Use of Remaining Resources –User-defined recovery behavior
8
Reflex + Healing ‘Reflex Action’: –Simple, –Rapid, –Real-Time, Guaranteed Response Time, –Sub-Optimal –Handle a Single Failure ‘Healing’ –Re-Evaluate Resources & Tasks –Re-Balance/Re-Allocate Resources –Recover Failed Resources (After Testing) –Generate New Reflex Actions
9
Reflex Mitigation Example Primary Task Secondary Task 1. Normal Operation 2. Processor Failure 3. Subdivide Primary Task 5. Replace Secondary Task 4. Migrate to Adjacent Processors User-Defined Mitigation Actions 5 +. Reset/Test Failed Processor
10
Healing Mitigation Example Primary Task Secondary Task 1. Normal Operation 2. Processor Failure Reflex Action 3. Update Models 5. Re-Plan System 4. Re-Evaluate Resources Mitigation Actions 6. Rearrange tasks Re-Eval Re-Plan
11
Design Issues Complex System –Thousands of Processors –High Data Rates –Real-Time Constraints User-Defined Behaviors –Domain-Specific Design Tool –System-Specific Implementation Run-Time Implementation –Heterogeneous Architecture –Real-Time - Execution & Mitigation –Fault-Tolerant
12
Analysis Local Oper. Manager Local Fault Mgr Trig Algo. ARMOR/RTOS Trig Algo. Trig Algo. Trig Algo. Logical Control Network L1/ DSP Local Oper. Manager Local Fault Mgr Trig Algo. ARMOR/RTOS Trig Algo. Trig Algo. Trig Algo. Logical Data Net DSP Local Oper Manager Local Fault Mgr Trig Algo. ARMOR/Linux Trig Algo. Trig Algo. Trig Algo. Logical Data Net Logical Control Network RISC Local Oper Manager Local Fault Mgr Trig Algo. ARMOR/Linux Trig Algo. Trig Algo. Trig Algo. L2,3/ RISC Region Operations Mgr Region Fault Mgr Runtime Design and Analysis Reconfig Behavior Algorithm Fault Behavior Resource Synthesis Performance Simulation Diagnosability Analysis Reliability Analysis System Models Soft Real-Time Hard Experiment Interface Synthesis Feedback Model Integrated Computing Logical Control Network Global Operations Manager Global Fault Manager
13
Modeling Language Processing & Data Flow Concepts: Processes, streams, data channels, Functions, data types, communication Hardware Resources Concepts: Processors, Memory, Topology, Reliability, Failure Modes,… Full Recov. Mode 1 Recov Mode 3 Recov. Mode 2 Concepts: Recovery Strategies, Modes of Operation, goals/importance Hierarchical Fault Management
14
Resource Models = Capture Hardware Resources Nodes Networks – Attributes – Hierarchy
15
Algorithm Models Processes Info Flow ° Interfaces ° Hierarchy
16
Fault Mitigation Models Regional Manager Local Manager Finite State Machine Parallel, Hierarchical Events & Transitions Mitigation Actions Time Specs State Transition Mitigation Actions Conditions
17
System Generation Algorithms Schedules Comm Maps SW Loads Resources OS Cfg Task Assign Boot Maps
18
Generation of Reflex Networks State A State CState B Action AC1 Action AC2 Action AC3 Action AB1 Action AB2 Action AB3 System Fault State Primary Struct. Reflex Struct. ON (L76 Fail) DO Del P1 Conn P1,S22 Map S22, C3 Kill T22 Migrate T33, Reflex Scripts * 1 Set for Each Processor And failure type
19
Model-Based Healing System Hardware MIC Healing Controller Nominal Faults Update Model Re- Balance New Reflex Interface Tasks, Links Reflex Program Reflex
20
Runtime Environment Global Manager Reflex ActionsMitigation Engine Regional Manager Mitigation Engine DSP Kernel Local Manager Mitigation Engine Actions Feed Back Actions Feed Back Reflex Actions Experiment Interface Model Interface DSP Hardware
21
Fault Mitigation Interface Fault Mitigation Interface: –The FMA interfaces with the local diagnostics facility (receive local status, clear errors, trigger rediagnosis, set diagnosis mode, etc. Commands –RETRY_LINK(link_id) Function: Reset/resync a comm link, Returns: failure or success –REROUTE_LINK(link_id) Function: Reroute communications through a separate link –ADD_TASK(task_id, link_id) Function: Adds a task to the task list, operate on data from link_id –TEST_MEMORY(memory_bank) Function: Intensive test on memory bank –RELOCATE_DATA(from_bank, to_bank) Function: Moves data, marks source memory bank as unused/unavail –GET_LOCAL_STATUS Function: Reports status of a resource on a local node –SEND_MESSAGE –RECEIVE_MESSAGE –...
22
Synthesis: Analysis/Offline Simulation –Functional (e.g. Matlab) –Performance (Timing, Discrete Event) –Interfacing/generating to Swarms/Jackal/TAEMS Diagnosability –Failure Modes + Sensors –Predict ability to Detect/Isolate Failures Reliability Analysis –Predict MTBF, Maximum Failures –Robustness Stability Analysis –Reconfiguration Strategies/Control System
23
System Simulation System Model Task Model Communication Model
24
Summary Developing Model-Based Approach –Capture Algorithm, Resource, and Mitigation Aspects Generation of Software –Normal application Code –Fault Mitigation Code Two Fault Mitigation Approaches –Reflex: Fast, Limited Response –Healing: Slower, system ‘re-design’ Analysis & Simulation Runtime Infrastructure
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.