1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping University, Sweden Paul Pop Computer Science & Engineering Group Technical University of Denmark, Denmark
2 of 14 2 Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc. Motivation Focus on transient faults and intermittent faults
3 of 14 3 Outline Motivation Background and limitations of previous work Contributions: Scheduling with fault tolerance requirements Trading-off transparency for performance Fault tolerance policy assignment Mapping optimization with transparency Checkpoint optimization Current work
4 of 14 4 P1P N1N1 P1P1 2 P1P1 P1P1 1 2 P1P1 P1P1 P 1/1 Fault Tolerance Techniques Error-detection overhead N1N1 Re-execution Checkpointing overhead P1P1 P1P1 1 2 N1N1 Rollback recovery with checkpointing Recovery overhead P 1(1) P 1(2) N1N1 N2N2 P 1(1) P 1(2) N1N1 N2N2 Active replication P 1/2 P1P1 1 P 1/1 2 P 1/2 2 N1N1 1
5 of 14 5 Fault-Tolerant Time-Triggered Systems Processes: Re-execution, Active Replication, Rollback Recovery with Checkpointing Messages: Fault-tolerant predictable protocol … Transient faults P2P2 P4P4 P3P3 P5P5 P1P1 m1m1 m2m2 Maximum k transient faults within each application run (system period) Static Scheduling
6 of 14 6 Limitations of Previous Work Design optimization with fault tolerance is limited Process mapping is not considered together with fault tolerance issues Multiple faults are not addressed in the framework of static cyclic scheduling Transparency, if at all addressed, is restricted to a whole computation node
7 of 14 7 Scheduling with Fault Tolerance Reqirements
8 of 14 8 P1P1P1P1 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 k =
9 of 14 9 P1P1P1P1 P2P2P2P2 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 40 k =
10 of P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P k = 2 P1P1P1P1 P 1/1 P 1/
11 of P2P2P2P P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P k = P 1/1 P 1/2 P 1/3
12 of P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P k = P 1/1 P 1/2 P2P2P2P2 P 2/1 P 2/2
13 of P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P k = P1P1P1P1 P2P2P2P2 P 2/1 P 2/2 P 2/3
14 of Conditional Schedule Table true P1P1 0 m1m1 P2P P1P1P1P1 P2P2P2P2 m1m1 k = 2 N1N1 N2N2
15 of Conditional Scheduling Conditional scheduling: Generates short schedules – Scheduling algorithm is very slow – Requires a lot of memory to store schedule tables Alternative: shifting-based scheduling Messages sent over the bus should be scheduled at one time Faults on one computation node must not affect other nodes Requires less memory Schedule generation is very fast – Schedules are longer
16 of Root Schedules P2P2 m1m1 P1P1 m2m2 m3m3 P4P4 P3P3 N1N1 N2N2 Bus P1P1 P1P1 Worst-case scenario for P 1 Recovery slack for P 1 and P 2
17 of Extracting Execution Scenarios P2P2 m1m1 P1P1 m2m2 m3m3 P 4/1 P3P3 N1N1 N2N2 Bus P 4/2 P 4/3
18 of Trading-off Transparency for Performance
19 of Good for debugging and testing FT Implementations with Transparency P2P2P2P2 P4P4P4P4 P3P3P3P3 P5P5P5P5 P1P1P1P1 m1m1 m2m2 – regular processes/messages – frozen processes/messages P3P3P3P3 Frozen Transparency is achieved with frozen processes and messages
20 of N1N1 N2N2 P1P1 P2P2 P3P3 N1N1 30X 20 X X N2N2 P4P4 X30 = 5 ms k = 2 No Transparency Deadline P2P2P2P2 P1P1P1P1 P4P4P4P4 m2m2 m1m1 m3m3 P3P3P3P3 P2P2P2P2 m1m1 P1P1P1P1 m2m2 m3m3 P4P4P4P4 P3P3P3P3 no fault scenario N1N1 N2N2 bus P2P2P2P2 m1m1 P1P1P1P1 m2m2 P4P4P4P4 P3P3P3P3 P1P1P1P1 P4P4P4P4 m3m3 the worst-case fault scenario N1N1 N2N2 bus processes start at different times messages are sent at different times
21 of Full Transparency Customized Transparency P2P2P2P2 P1P1P1P1 m2m2 m3m3 P3P3P3P3 P3P3P3P3 m1m1 P4P4P4P4 P3P3P3P3 Customized transparency P2P2P2P2 m1m1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 no fault scenario P2P2P2P2 m1m1 P1P1P1P1 P1P1P1P1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 P2P2P2P2 m1m1 P1P1P1P1 m2m2 P4P4P4P4 P3P3P3P3 P1P1P1P1 P4P4P4P4 m3m3 No transparency Deadline P2P2P2P2 m1m1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 P3P3P3P3 P3P3P3P3 Full transparency
22 of Trading-Off Transparency for Performance 0% 25% 50% 75% 100% k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k= Four (4) computation nodes Recovery time 5 ms Trading transparency for performance is essential How longer is the schedule length with fault tolerance? increasing transparency
23 of Fault Tolerance Policy Assignment
24 of Fault Tolerance Policy Assignment P 1/1 P 1/2 P 1/3 Re-execution N1N1 P 1(1) P 1(2) P 1(3) Replication N1N1 N2N2 N3N3 P 1(1)/1 P 1(2) N1N1 N2N2 P 1(1)/2 Re-executed replicas 2
25 of Re-execution vs. Replication N1N1 N2N2 P1P1 P3P3 P2P2 m1m1 1 P1P1 P2P2 P3P3 N1N1 N2N A1A1 P1P1 P3P3 P2P2 m1m1 m2m2 A2A2 N1N1 N2N2 bus P1P1 P2P2 P3P3 Missed Deadline P 1(1) N1N1 N2N2 bus P 1(2) P 2(1) P 2(2) P 3(1) P 3(2) Met m 1(2) m 1(1) m 2(2) m 2(1) Replication is better N1N1 N2N2 bus P1P1 P2P2 P3P3 Met Deadline P 1(1) N1N1 N2N2 bus P 1(2) P 2(1) P 2(2) P 3(1) P 3(2) Missed m 1(2) m 1(1) Re-execution is better
26 of P1P1 N1N1 N2N2 bus P2P2 P3P3 P4P4 m2m2 Missed N1N1 N2N2 P 1(1) P 3(2) P 4(1) P 2(1) P 1(2) bus P 2(2) P 3(1) P 4(2) Missed m 1(2) m 1(1) m 2(1) m 2(2) m 3(1) m 3(2) N1N1 N2N2 P 1(1) P3P3 P4P4 P2P2 P 1(2) m 2(1) m 1(2) bus Met Optimization of fault tolerance policy assignment Fault Tolerance Policy Assignment P1P1 P2P2 P3P3 N1N1 N2N P4P N1N1 N2N2 P1P1 P4P4 P2P2 P3P3 m1m1 m2m2 m3m3 Deadline
27 of Optimization Strategy Design optimization: Fault tolerance policy assignment Mapping of processes and messages Root schedules Three tabu-search optimization algorithms: 1.Mapping and Fault Tolerance Policy assignment (MRX) Re-execution, replication or both 2.Mapping and only Re-Execution (MX) 3.Mapping and only Replication (MR) Tabu-search Shifting-based scheduling
28 of Experimental Results Mapping and replication (MR) 20 Mapping and re-execution (MX) Mapping and policy assignment (MRX) Number of processes Avgerage % deviation from MRX Schedulability improvement under resource constraints
29 of Checkpoint Optimization
30 of Locally Optimal Number of Checkpoints 1 = 15 ms k = 2 1 = 5 ms 1 = 10 ms P1P1 C 1 = 50 ms No. of checkpoints 2 P1P1 P1P1 12 P1P1 P1P1 P1P P1P1 P1P1 P1P1 P1P P1P1 5 P1P1 P1P1 P1P1 P1P1 P1P
31 of Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 P1P1 P2P2 P2P2 P1P
32 of % 10% 20% 30% 40% Global Optimization of Checkpoint Distribution (MC) % deviation from MC0 ( how smaller the fault tolerance overhead ) Application size (the number of tasks) 4 nodes, 3 faults Local Optimization of Checkpoint Distribution (MC0) Global Optimization vs. Local Optimization Does the optimization reduce the fault tolerance overheads on the schedule length?
33 of Conclusions Scheduling with fault tolerance requirements Two novel scheduling techniques Handling customized transparency requirements Design optimization with fault tolerance Policy assignment optimization strategies Checkpoint optimization
34 of To be continued… Design Optimization and Scheduling of Mixed Soft-Hard Real-Time Systems Soft Processes Soft Deadlines Hard Processes Hard Deadlines
35 of Design Optimization of Embedded Systems with Fault Tolerance is Essential