1 ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo.

Slides:



Advertisements
Similar presentations
EE5900 Advanced Embedded System For Smart Infrastructure
Advertisements

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,
The System-Level Simplex Architecture Stanley Bak Olugbemiga Adekunle Deepti Kumar Chivukula Mu Sun Marco Caccamo Lui Sha.
Timed Automata.
DEXA 2005 Control-based Quality Adaptation in Data Stream Management Systems (DSMS) Yicheng Tu†, Mohamed Hefeeda‡, Yuni Xia†, Sunil Prabhakar†, and Song.
Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)
CprE 458/558: Real-Time Systems (G. Manimaran)1 CprE 458/558: Real-Time Systems (m, k)-firm tasks and QoS enhancement.
A New Eigenstructure Fault Isolation Filter Zhenhai Li Supervised by Dr. Imad Jaimoukha Internal Meeting Imperial College, London 4 Aug 2005.
Linear Obfuscation to Combat Symbolic Execution Zhi Wang 1, Jiang Ming 2, Chunfu Jia 1 and Debin Gao 3 1 Nankai University 2 Pennsylvania State University.
Venkataramanan Balakrishnan Purdue University Applications of Convex Optimization in Systems and Control.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
P. Albertos* & A. Crespo + Universidad Politécnica de Valencia * Dept. of Systems Engineering and Control, + Dept. of Computer Engineering POB E
SIGMETRICS 2008: Introduction to Control Theory. Abdelzaher, Diao, Hellerstein, Lu, and Zhu. CPU Utilization Control in Distributed Real-Time Systems Chenyang.
Lui Sha, Summer Overview – A recipe for successful research: –how to think independently, differently and boldly (lectures 1 - 2) –how to analyze.
Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
Testing an individual module
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
Misconceptions About Real-time Computing : A Serious Problem for Next-generation Systems J. A. Stankovic, Misconceptions about Real-Time Computing: A Serious.
Real-Time Operating System Chapter – 8 Embedded System: An integrated approach.
Embedded System Design Framework for Minimizing Code Size and Guaranteeing Real-Time Requirements Insik Shin, Insup Lee, & Sang Lyul Min CIS, Penn, USACSE,
DATE Optimizations of an Application- Level Protocol for Enhanced Dependability in FlexRay Wenchao Li 1, Marco Di Natale 2, Wei Zheng 1, Paolo Giusto.
Solver & Optimization Problems n An optimization problem is a problem in which we wish to determine the best values for decision variables that will maximize.
Issues on Software Testing for Safety-Critical Real-Time Automation Systems Shahdat Hossain Troy Mockenhaupt.
EMBEDDED SOFTWARE Team victorious Team Victorious.
Normalised Least Mean-Square Adaptive Filtering
Unit 3a Industrial Control Systems
The Design and Performance of A Real-Time CORBA Scheduling Service Christopher Gill, David Levine, Douglas Schmidt.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Testimise projekteerimine: Labor 2 BIST Optimization
Real Time Process Control (Introduction)
CMSC 345 Fall 2000 Unit Testing. The testing process.
1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha.
Transformation of Timed Automata into Mixed Integer Linear Programs Sebastian Panek.
Scheduling policies for real- time embedded systems.
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Real-Time Scheduling CS4730 Fall 2010 Dr. José M. Garrido Department of Computer Science and Information Systems Kennesaw State University.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.
Scheduling Periodic Real-Time Tasks with Heterogeneous Reward Requirements I-Hong Hou and P.R. Kumar 1 Presenter: Qixin Wang.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Low Level Control. Control System Components The main components of a control system are The plant, or the process that is being controlled The controller,
What is Software Engineering? The discipline of designing, creating, and maintaining software by applying technologies and practices from computer science,
Daniel Liberzon Coordinated Science Laboratory and
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
An OBSM method for Real Time Embedded Systems Veronica Eyo Sharvari Joshi.
Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.
Lab 3 Real-Time Control of a Hot Air Plant using RTOS µC/OSII Due Date: Week of Nov. 2 nd, 2010.
Lecture 25: Implementation Complicating factors Control design without a model Implementation of control algorithms ME 431, Lecture 25.
CSCI1600: Embedded and Real Time Software Lecture 23: Real Time Scheduling I Steven Reiss, Fall 2015.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
1 G53SRP: Introduction to Real Time Scheduling Chris Greenhalgh School of Computer Science.
Unit - I Real Time Operating System. Content : Operating System Concepts Real-Time Tasks Real-Time Systems Types of Real-Time Tasks Real-Time Operating.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
Lecture 4 Page 1 CS 111 Summer 2013 Scheduling CS 111 Operating Systems Peter Reiher.
Networked Embedded Control System - Integration of control and computing Moonju Park Dept. of Computer Science & Engineering University of Incheon 1.
Coordinator MPC with focus on maximizing throughput
Process Management Deadlocks.
Applying Control Theory to Stream Processing Systems
Chapter 2 Scheduling.
Tradeoff Analysis of Strategies for System Qualities
Digital Control Systems Waseem Gulsher
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Software Verification, Validation, and Acceptance Testing
Processes and operating systems
Case Study 1 By : Shweta Agarwal Nikhil Walecha Amit Goyal
Control Theory in Log Processing Systems
Presentation transcript:

1 ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

2 Outline Motivation and related work ORTEGA goals ORTEGA architecture Details of ORTEGA designs Implementation and evaluation Demo

3 Motivations Cyber-Physical Systems Real-world systems involves not only computer science, but knowledge related to various disciplines. Not only the computer system becomes more complex, the complexity of integrated system (i.e. the cyber- physical system) grows even faster. Major challenge: how to let engineers of drastically different backgrounds collaborate with each other?

4 Motivations Control Systems Conventional analog control systems Digital control systems Computer Systems Real-time scheduling Fault tolerance Reliable/online software upgrade We need to design a framework so that computer engineers and control engineers can easily collaborate and integrate their knowledge

5 Motivations Control Systems Conventional analog control systems Digital control systems Computer Systems Real-time scheduling Fault tolerance Reliable/online software upgrade We need to design a framework so that computer engineers and control engineers can easily collaborate and integrate their knowledge

6 Related work: Simplex architecture Demand: Low cost development of upgraded control systems for mission critical control applications instead of multi-versioning, just develop one version Focus on the control theories Runtime upgrade/testing of the single version buggy new system. Applications: Aircraft control (F-16, Seto et. al, 2000) Submarine control (NSSN, new attack submarine program at US navy)

7 Simplex for real-time control Simple high assurance control subsystem (HAC) Complex high performance control subsystem (HPC) Plant Decision Simplex Architecture

8 Simplex for real-time control The above LTI control system is stable iff there exists a P>0, such that the Lyapunov function Given LTI control system:

9 Simplex for real-time control Maximum Stability Region (Recovery Region) Stability Region Lyapunov Functions State Constraints We can choose smaller solution ellipsoid (i.e. x T Px < x T P max x) to leave margins to guard against model/actuator/measurement errors.

10 Drawbacks of Simplex P1: Lack of Efficiency Analytically redundant high assurance controller (HAC) runs in parallel with complex controller (HPC) Lowers system performance, increase operating costs Limits the application of Simplex in only safety-critical domains P2: Lack of Flexibility Enforces the same execution period on HAC and HPC In practice, different controllers may use different periods for different performance considerations For example: fast HAC recovery

11 Design goals of ORTEGA On-demand Real-TimE GuArd (ORTEGA) A new efficient fault tolerance software architecture designed for real-time control systems More efficient resource usage (P1) Through on-demand real-time recovery Flexible design (P2) Allows HAC and HPC to run at different rates Through new design and schedulability analysis Applicable to a wider range of real-time control systems

12 ORTEGA Architecture

13 On-demand execution of HAC At any time, only one of the HAC or HPC is running to control the plant Decision module (DM) uses a mutex semaphore to control which of the HAC and HPC is running When the HPC is running well, the HAC blocks on the semaphore; Only when a fault is detected in the HPC, the DM releases the semaphore to allow HAC to take over Decision logic is based on stability regions Determined through Linear Matrix Inequality theory Details later

14 CPU savings of ORTEGA HPC’s timing parameters: {C p, T p }; HAC’s timing parameters: {C a, T a }; Pr: the percentage of time for recovery (HAC) during a total time of T Total CPU resource usage under Simplex Total CPU resource usage under ORTEGA CPU resource usage savings:

15 No Free Lunch: An extra period of delay up to T a incurred due to the on-demand execution of HAC ORTEGA Simplex ORTEGA

16 Handle the extra delay by state projections (1)Extra delay causes disturbances when fault occurs (infrequent) (2)But the gain in resource usage is large. Resource usage reduction v.s. extra delay :

17 Recovery region design Maximum Stability Region (Recovery Region) Stability Region Lyapunov Functions State Constraints The decision module uses recovery region to determine when to switch to HAC Recovery region is defined as the maximum region in which the HAC can make the plant stable

18 Determine recovery region (1) State constraints: Digital controllers: Stability region: The discrete LTI control system is stable iff there exists a P>0, such that

19 Determine recovery region (1) State constraints: Digital controllers: Stability region: Stability region of the system with respect to P is defined as

20 Determine recovery region (2) Area of recovery region Theorem: Determine the maximum stability region of digital implemented closed loop system with constraints (1) can be transformed to the following MAXDET (LMI) problem. Stability State constraints

21 Recovery region v.s. control loop period Stability Index A(T): Area of the maximum stability region It is a function of the control loop period T. The smaller the controller loop period, the larger the maximum stability region. Example: an inverted pendulum Controller The smaller the period, the larger the recovery region. System model ORTEGA allows larger recovery region (more flexible)

22 Implementation and evaluation Inverted pendulum from Quanser CPU: Pentium II 350MHz OS: Linux kernel with RMS HAC: field tested state feedback controller Evaluation of CPU savings If HAC and HPC both run at 50Hz, ORTEGA’s CPU saving is 29.29% If HAC runs at 50Hz, HPC runs at 20Hz, ORTEGA’s CPU saving is 50.87%

23 Evaluation of fault tolerance Infinite loop bug Non-performing bug Maximum control output bug Divided by zero bug Bang-Bang type bug Positive feedback bug Tricky design bug …

24 Evaluation of fault tolerance

25 Evaluation of fault tolerance

26 Thank You Q&A

27 Backup Slides

Simplex: software engineering economics: the more effort, the more reliable Reliability:0 failure happened during [0, t] Failure RateComplexity Effort

Simplex: comparison with N-version

Simplex: roots in recover-block: only one version must be correct

Simplex: recovery-blcok: dividing into more alternatives doesn’t always gain.

Simplex: recovery-blcok: reducing complexity gains One alternative has complexity 1, ½, and 1/10

Simplex: a two-alternatives recovery block with reduced complexity wins if a reliable acceptance test is possible.

Simplex: recovery-blcok: reducing complexity gains RB2: 2 alter, same complexity C=1, perfect acceptance test *RB2L5: 2 alter, C1=1, C2=1/5, imperfect acceptance test whose reliability = alter. 2

35 Schedulability analysis of ORTEGA

36 Mode-Change Problem Incurred by Recovery Example: Suppose one plant  1 p : (C 1 p,T 1 p ) = (3,5);  1 a : (C 1 a,T 1 a ) = (4,10) ; with another real time task  2 : (C 2,T 2 ) = (6,15). Unschedulable of tasks due to the recovery Before the recovery at t=10, {  1 p,  2 } = {(3,5), {6,15}} is schedulable; After the recovery transition, {  1 a,  2 } = {(4,10), {6,15}} is also schedulable; However, during the transition of recovery,  2 misses its deadline at t=15! Mode-change in fixed priority scheduling is a well-recognized difficult problem by the real-time community

37 Schedulability Analysis Schedulability Analysis: We adopt the work by Real and Crespo (2004) Idea: Analyze the transitional scheduling overhead incurred by the recovery. (I) Schedulability analysis of steady state task set (II) Schedulability analysis of old-mode tasks with transitional scheduling overhead (due to the mode change) (III) Schedulability analysis of new-mode tasks with transitional scheduling overhead (due to the mode change)

38 Fault Tolerance and Scheduling Co-design -- one FT-enabled task case Maximize the recovery region subject to schedulability constraint Find the smallest (optimal) control loop period T k * a, s.t. the task set is schedulable under random recoveries Given the schedulability test, we can use binary search algorithm to find T k *a

39 Sampling time h, Zero-order hold P2: Recovery Region for Digital Controllers Controller Theorem (Lyapunov): A discrete time LTI system shown above is stable iff there exists a matrix P>0, such that

40 Stability Region (Continued) Stability region of the system with respect to P is defined as: Stability Region with Constraints State constraints Control input constraints Can be combined in the closed loop system as Lemma: The stability region defined above satisfy constraints (1) iff