Anand Bhat, Soheil Samii†, Raj Rajkumar *Carnegie Mellon University

Slides:

Advertisements

Similar presentations

Simulation of Feedback Scheduling Dan Henriksson, Anton Cervin and Karl-Erik Årzén Department of Automatic Control.

Advertisements

EE5900 Advanced Embedded System For Smart Infrastructure

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

CPE555A: Real-Time Embedded Systems

Carnegie Mellon R-BATCH: Task Partitioning for Fault-tolerant Multiprocessor Real-Time Systems Junsung Kim, Karthik Lakshmanan and Raj Rajkumar Electrical.

Bogdan Tanasa, Unmesh D. Bordoloi, Petru Eles, Zebo Peng Department of Computer and Information Science, Linkoping University, Sweden December 3, 2010.

1 of 30 June 14, 2000 Scheduling and Communication Synthesis for Distributed Real-Time Systems Paul Pop Department of Computer and Information Science.

TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.

1 of 14 1/15 Schedulability Analysis and Optimization for the Synthesis of Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded.

8. Fault Tolerance in Software

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Towards a Contract-based Fault-tolerant Scheduling Framework for Distributed Real-time Systems Abhilash Thekkilakattil, Huseyin Aysan and Sasikumar Punnekkat.

1 Reducing Queue Lock Pessimism in Multiprocessor Schedulability Analysis Yang Chang, Robert Davis and Andy Wellings Real-time Systems Research Group University.

Scheduling policies for real- time embedded systems.

Integrated Scheduling and Synthesis of Control Applications on Distributed Embedded Systems Soheil Samii 1, Anton Cervin 2, Petru Eles 1, Zebo Peng 1 1.

Real-Time CORBA By Christopher Bolduc. What is Real-Time? Real-time computing is the study of hardware and software systems that are subject to a “real-

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.

Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.

DISTRIBUTED COMPUTING

Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.

CSCI1600: Embedded and Real Time Software Lecture 23: Real Time Scheduling I Steven Reiss, Fall 2015.

Dynamic Control of Coding for Progressive Packet Arrivals in DTNs.

1 of 14 1/15 Schedulability-Driven Frame Packing for Multi-Cluster Distributed Embedded Systems Paul Pop, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB)

Fault-Tolerant Rate- Monotonic Scheduling Sunondo Ghosh, Rami Melhem, Daniel Mosse and Joydeep Sen Sarma.

Distributed Process Scheduling- Real Time Scheduling Csc8320(Fall 2013)

Optimization of Time-Partitions for Mixed-Criticality Real-Time Distributed Embedded Systems Domițian Tămaș-Selicean and Paul Pop Technical University.

Embedded System Scheduling

CHaRy Software Synthesis for Hard Real-Time Systems

ECE 720T5 Fall 2012 Cyber-Physical Systems

REAL-TIME OPERATING SYSTEMS

Primary-Backup Replication

Memory Management.

Multiprocessor Real-Time Scheduling

Advanced Operating Systems CIS 720

Introduction to Load Balancing:

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Prabhat Kumar Saraswat Paul Pop Jan Madsen

A Study of Group-Tree Matching in Large Scale Group Communications

Wayne Wolf Dept. of EE Princeton University

Cluster Communications

Paul Pop, Petru Eles, Zebo Peng

Ying Zhang&Krishnendu Chakrabarty Presenter Kasım Sert

Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling

Anand Bhat1, Dr. Soheil Samii2, Dr. Raj Rajkumar1

ElasticTree Michael Fruchtman.

Period Optimization for Hard Real-time Distributed Automotive Systems

CprE 458/558: Real-Time Systems

ISP and Egress Path Selection for Multihomed Networks

Supporting Fault-Tolerance in Streaming Grid Applications

Fault and Energy Aware Communication Mapping with Guaranteed Latency for Applications Implemented on NoC Sorin Manolache, Petru Eles, Zebo Peng {sorma,

湖南大学-信息科学与工程学院-计算机与科学系

Buffer Space Optimisation with Communication Mapping and Traffic Shaping for NoCs Sorin Manolache, Petru Eles, Zebo Peng Linköping University, Sweden.

Fault Tolerance Distributed Web-based Systems

Real Time Scheduling Mrs. K.M. Sanghavi.

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

EEC 688/788 Secure and Dependable Computing

Networked Real-Time Systems: Routing and Scheduling

Jian-Jia Chen and Tei-Wei Kuo

EEC 688/788 Secure and Dependable Computing

Algorithms for Budget-Constrained Survivable Topology Design

Processes and operating systems

Department of Electrical Engineering Joint work with Jiong Luo

Mark McKelvin EE249 Embedded System Design December 03, 2002

EEC 688/788 Secure and Dependable Computing

Guaranteeing Message Latencies on Controller Area Network (CAN)

Presentation transcript:

Recovery-Time Considerations in Real-Time Systems Employing Software Fault Tolerance Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University †General Motors R&D

Outline Motivation System Model Recovery Time Analysis Task Model Platform Model Fault Model Recovery Time Analysis Redundant-Task Type Assignment Task partitioning with Recovery Time Constraints RTT Heuristic Simulated Annealing Approach Evaluation Conclusions and Future Directions

Motivation Real-time systems like cars are inherently safety-critical. To ensure safety, computing systems must be fault-tolerant. Typically fault tolerance is achieved with the help of redundancy.

Traditional Redundancy Approaches Hardware(HW) Replication Triple Modular Redundancy (TMR) Hardware replication is inefficient in terms of Cost Weight Space Input HW1 HW2 HW3 Hence, there is a need for adaptive cost-optimized software fault tolerance solutions. VOTER Output

Software Fault Tolerance Software replication can be of different types. Active Replication: All tasks perform all operations Primary Backup Approach: Only primary performs all operations There is built-in trade-off between time required to recover from failure (Recovery Time) to system resources utilized.

Recovery Time Requirements The recovery time for a task depends on: redundant-task type assignment, task allocation, and scheduling policy. Each task can have a different recovery time requirement (RTR), i.e. the number of deadlines that can be missed in the case of a failure. The redundant-task type assignment and allocation must be selected to meet the RTR for each task.

Outline Motivation System Model Recovery Time Analysis Task Model Platform Model Fault Model Recovery Time Analysis Redundant-Task Type Assignment Task partitioning with Recovery Time Constraints RTT Heuristic Simulated Annealing Approach Evaluation Conclusions and Future Directions

Computation Model (1 of 2) Periodic Task Model (Ci, Ti, Di) Ci: Worst-case execution time Ti: Period Di: Deadline (Di = Ti) Ui: Utilization (Ui = Ci/Ti) Fixed Priority Preemptive Scheduling Rate Monotonic Scheduling (RMS) Worst-case completion time (WCCTi): The worst-case time from arrival to completion of task. A task is said to be schedulable if its WCCTi ≤ Di.

Computation Model (2 of 2) Sample Task Set Brake Control (BC) (Ui = 0.4) (2 copies) RTR = 0 Throttle Control (TC) (Ui = 0.3) (2 copies) RTR = 1 Steering Control (SC) (Ui = 0.6) (1 copies) RTR = 3 Video Playback (VP) (Ui = 0.8) (0 copies) Audio Playback (AP) (Ui = 0.8) (0 copies) HVAC (Ui = 0.4) (0 copies) Critical tasks Non-critical tasks

Fault Model (1 of 2) Crash Faults (fail-stop) Hardware, Operating system and Process crashes Communication Assumption: The network is analyzable and every message in the network is delivered within a known delay bound  or not delivered. Fault detection Assumption: Replicas monitor the status and health of the primary, for example, by using heartbeats

Fault Model (2 of 2) We assume three types of Redundant-Task Type Assignment Type of Task Scheduled Read and Process Inputs Calculate State Perform Calculations Produce Output Primary  Active Replica Hot Standby  Cold Standby Primary (0.1) Hot (0.1) Cold (0.1 OR 0.05)

Goals Given N nodes, and a task set = {1, 2, ..., n}, where every task has an application-dependent recovery-time requirement RTRi derive bounds on the recovery time for each redundant-task type, decide a redundant-task type, i.e., active, hot or cold, and find an allocation where all tasks satisfy their recovery-time requirements while minimizing the number of nodes used for allocation.

Outline Motivation System Model Recovery Time Analysis Task Model Platform Model Fault Model Recovery Time Analysis Redundant-Task Type Assignment Task partitioning with Recovery Time Constraints RTT Heuristic Simulated Annealing Approach Evaluation Conclusions and Future Directions

Recovery Time Bounds – Active Replicas Primary Failure Implicit Recovery Primary Active Replica Time Active replicas run alongside the primary They allow seamless fault recovery

Backup Following the Primary Uncontrolled Release offsets Backup Following the Primary

Recovery Time Bounds - Hot Standbys

Recovery Time Bounds - Cold Standbys

Outline Motivation System Model Recovery Time Analysis Task Model Platform Model Fault Model Recovery Time Analysis Redundant-Task Type Assignment Task partitioning with Recovery Time Constraints RTT Heuristic Simulated Annealing Approach Evaluation Conclusions and Future Directions

Redundant-Task Type Assignment A recovery time of a redundant task must satisfy the RTR of the task. For RTR = n, RT ≤ (n+1) T We prioritize Cold and Hot standbys over Active replicas since they are more resource efficient.

Redundant-Task Type Assignment The task allocation directly affects the choice of redundant-task type assignment.

Outline Motivation System Model Recovery Time Analysis Task Model Platform Model Fault Model Recovery Time Analysis Redundant-Task Type Assignment Task partitioning with Recovery Time Constraints RTT Heuristic Simulated Annealing Approach Evaluation Conclusions and Future Directions

Fault-Tolerant Task Allocation Problem definition: Given a task set, assign every task to a node such that No primary or any of its standby is allocated to the same node while minimizing the number nodes used for allocation. Node 1 Node 1 Node 2 Hot(0.1) Primary (0.1) Primary (0.1) Primary (0.1) Hot (0.1) Placement Conflict No Placement Conflict Goal: Avoid common-mode failures while optimizing cost, weight and space.

Task Allocation Heuristics Finding an optimal task allocation given an arbitrary task set is NP-hard. Hence we use heuristics for task allocation. Any Task Allocation Heuristic has 2 steps Task Ordering Node Ordering

RTR-tiered (RTT) heuristic (1 of 3) Task Ordering – Tiered by increasing RTR constraints and descending order of Task Utilization (Ui) in each tier for critical tasks.

RTR-tiered (RTT) heuristic (2 of 3) Node Ordering – Non increasing total task utilization order excluding nodes with placement conflicts

RTR-tiered (RTT) heuristic (3 of 3) Non –critical tasks are ordered in decreasing utilization values and assigned last after accounting for lower replica utilizations

Evaluation: Standby Distribution

Evaluation : Node Usage

Applying Simulated Annealing Problem formulation: Given n tasks (1, 2, ..., n), with utilization (u1, u2, ..., un), where ui ≤ 1, find the number of nodes M of size 1 that are needed to pack all tasks such that a primary task and its corresponding redundant copies obey the placement constraint of not being co-located on the same node and optimizing the following cost function

Simulated Annealing

Generating Random Solutions (1 of 2) We randomly move a single task from a randomly-selected node k to another randomly-selected node l.

Generating Random Solutions (2 of 2) We randomly select two tasks currently located in two different bins and swap them.

Evaluation: Node Usage

Evaluation: Execution Time

Conclusions and Future Work RTR of a task can be mapped to a redundant-task type assignment. Recovery-Time Tiered (RTT) prioritizes the RTR constraints in the task allocation sequence and produces better allocations than the TPCDC+R The simulated annealing method was adapted to solve the fault-tolerant task allocation optimization problem it produces allocations utilizing fewer computing resources than the proposed heuristics, at the cost of substantial run-time.

Thank you

Questions?