Fardin Abdi, Renato Mancuso, Stanley Bak, Or Dantsker, Marco Caccamo 21st Conference on Emerging Technologies Factory Automation Reset-Based Recovery for Real-Time Cyber-Physical Systems with Temporal Safety Constraints Fardin Abdi, Renato Mancuso, Stanley Bak, Or Dantsker, Marco Caccamo
Safety Critical CPS
CPS Safety Constraints Physical Limits Regulations MAX Altitude by FAA
Safety is Only Meaningful with Liveliness
Software Faults: Main Obstacle for Safety and Liveliness
Software Fault; A major challenge Verification: Cost Correctness Upgrades Time/Cost 3rd party SW Specialized Knowledge Not always doable Testing: No Guarantees
Our approach: Tolerate Faults and Recover using Restarts
Recovery Using Restarts Cyber-Physical Systems Traditional Computers First, most of the bugs in production quality software are Heisenbugs \cite{candea2001recursive} which are hard to reproduce or depend on the timing of external events, for example race condition. Restarting is very effective in recovering from this type of bugs. Second, restarting can claim all the stale resources, clean up all the corrupt state (e.x. memory leaks, dangling pointers, damaged heap) and take system back into a known well-tested state within a predictable amount of time
Two Type of Safety Constraints
System Constraints I: Linear Constraints: Example: \left\{ \begin{array}{cc} p < 2 &\\ p/4 + t < 2.5 &\\ \end{array}\right. pressure Temperature
System Constraints II: Overrun Constraints: Example: \text{Stress}(p) = \left\{ \begin{array}{cc} 1 & p > 10\\ 0 & p \leq 10\\ \end{array}\right. \int_{t}^{t+16} \text{Stress}(p(\tau))\cdot d\tau \leq 15 P=10 Power Time
Architecture WD timers: Restart the board if components fail Sensors FS Switch Control Command Complex Controller Physical plant WD Timer MUX Safety Controller FS Enable RTR Module RESET PIN Rescue Unit Main Unit WD timers: Restart the board if components fail SC: Can always keep the system safe RTR: Predicts if the future states are safe CC: Not verified, can create unsafe commands FS switch: switch to SC during the restart Rescue Unit: Bare Metal, verified Main Unit: OS/Firmware Can fail
Fault Model Rescue Unit: Verified and no faults RTR unit: Fail-stop failure model Complex Controller: Any type of fault
Safety Controller Design Goals: To keep system within the Linear constraints To satisfy the overrun constraints To stay within the limits of actuators Strategy: To find a region where all the above are always satisfied To design a state feedback controller that keeps the system within that region
Finding a safe region for Overrun Constraints Example: O = \{x | \text{Stress}(x) \leq \frac{C}{T^{win}}\} \forall t; \int_{t}^{t+{T^{win}}} \text{Stress}(x(\tau))\cdot d\tau \leq C
Safety Controller Design Linear Constraints: Gamma: Intersection of all the Linear Inequalities. Overrun Constraints: Actuator Limits: a^T_m\cdot x \leq 1, m = 1, \dots, q,\\ c_{i,k}^T\cdot x \leq 1, k = 1, \dots, p_i, i = 1, \dots, p,\\ b^T_j\cdot u \leq 1, j=1,\dots,r Use an LMI solver, to find a linear state feedback controller and its Q matrix.
Under the control of SC, any point inside R, will remain inside R. Stability Region Under the control of SC, any point inside R, will remain inside R. Gamma Stability Region, R
Switching Condition for Hard Constraints \text{Reach}_{\leq T_{c}}(x, CC) \subseteq \mathcal{S} \text{Reach}_{\leq T_s}(\text{Reach}_{\leq T_{c}}(x, CC), SC) \subseteq \mathcal{S} \item$\text{Reach}_{= T_s }(\text{Reach}_{\leq T_{c}}(x, CC), SC) \subseteq \mathcal{R}
Switching Condition for Hard Constraints Safe region, S Stability Region, R
Switching Conditions for Overrun Constraints Due to design of Stability Region \int_{0}^{{T^{win}}} \text{\normalfont{Stress}}(x(\tau))\cdot d\tau \leq \alpha C
Switching Conditions for Overrun Constraints We keep track of the past stress in an array. We predict future stress using reachability analysis. 𝑇 𝑤𝑖𝑛 = 14 𝑇 𝑐 10 3 5 4 7 9 11 1 16 2 6 8 15 Time Stored in array Future Predictions Interval of time Sum of stress in this interval of time Current Time
Evaluations
Restarting in Action
Flight Trace
Progress Analysis
Stability Region Size – Experiment 1 No Overrun constraints LMI-Simplex RTR, Our approach
Stability Region Size – Experiment 2 No Overrun constraints: LMI-Simplex RTR With Overrun constraints: Our approach
Thank You!
Support Slides
Introducing 𝜶 \Gamma = \{x| \\a^T_m\cdot x \leq 1, m = 1, \dots, q,\\ c_{i,k}^T\cdot x \leq 1, k = 1, \dots, p_i, i = 1, \dots, p,\\ b^T_j\cdot u \leq 1, j=1,\dots,r \}
If O was not Linear Finding a convex Region inside O:
How to predict stress using reachability. MaxSumStress([ 𝑡 1 , 𝑡 2 ]): Return the maximum of integral of stress function in a given window [ 𝑡 1 , 𝑡 2 ] Power Time