Download presentation
Presentation is loading. Please wait.
1
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden Presentation of Licentiate Thesis
2
2 of 14 2 Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc. Motivation Focus on transient faults and intermittent faults
3
3 of 14 3 Motivation Transient faults Radiation Electromagnetic interference (EMI) Lightning storms Happen for a short time Corruptions of data, miscalculation in logic Do not cause a permanent damage of circuits Causes are outside system boundaries
4
4 of 14 4 Motivation Intermittent faults Internal EMI Crosstalk Power supply fluctuations Init (Data) Software errors (Heisenbugs) Transient faults Manifest similar as transient faults Happen repeatedly Causes are inside system boundaries
5
5 of 14 5 Motivation Errors caused by transient faults have to be tolerated before they crash the system However, fault tolerance against transient faults leads to significant performance overhead Transient faults are more likely to occur as the size of transistors is shrinking and the frequency is growing
6
6 of 14 6 Motivation Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc. The Need for Design Optimization of Embedded Systems with Fault Tolerance
7
7 of 14 7 Outline Motivation Background and limitations of previous work Thesis contributions: Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency Conclusions and future work
8
8 of 14 8 General Design Flow System Specification Architecture Selection Mapping & Hardware / Software Partitioning Scheduling Back-end Synthesis Feedback loops Fault Tolerance Techniques
9
9 of 14 9 P1P1 0 20 40 60 N1N1 P1P1 2 P1P1 P1P1 1 2 P1P1 P1P1 P 1/1 Fault Tolerance Techniques Error-detection overhead N1N1 Re-execution Checkpointing overhead P1P1 P1P1 1 2 N1N1 Rollback recovery with checkpointing Recovery overhead P 1(1) P 1(2) N1N1 N2N2 P 1(1) P 1(2) N1N1 N2N2 Active replication P 1/2 P1P1 1 P 1/1 2 P 1/2 2 N1N1 1
10
10 of 14 10 Limitations of Previous Work Design optimization with fault tolerance is limited Process mapping is not considered together with fault tolerance issues Multiple faults are not addressed in the framework of static cyclic scheduling Transparency, if at all addressed, is restricted to a whole computation node
11
11 of 14 11 Outline Motivation Background and limitations of previous work Thesis contributions: Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency Conclusions and future work
12
12 of 14 12 Fault-Tolerant Time-Triggered Systems Processes: Re-execution, Active Replication, Rollback Recovery with Checkpointing Messages: Fault-tolerant predictable protocol … Transient faults P2P2 P4P4 P3P3 P5P5 P1P1 m1m1 m2m2 Maximum k transient faults within each application run (system period)
13
13 of 14 13 Scheduling with Fault Tolerance Reqirements Conditional Scheduling Shifting-based Scheduling Conditional Scheduling
14
14 of 14 14 P1P1P1P1 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 k = 2 020406080100120140160180200
15
15 of 14 15 P1P1P1P1 P2P2P2P2 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 40 k = 2 020406080100120140160180200
16
16 of 14 16 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 P1P1P1P1 P 1/1 P 1/2 020406080100120140160180200
17
17 of 14 17 P2P2P2P2 020406080100120140 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 160180200 90 130 P 1/1 P 1/2 P 1/3
18
18 of 14 18 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 90 13014085 020406080100120140160180200 P 1/1 P 1/2 P2P2P2P2 P 2/1 P 2/2
19
19 of 14 19 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 90 130150 95 14085 020406080100120140160180200 P1P1P1P1 P2P2P2P2 P 2/1 P 2/2 P 2/3
20
20 of 14 20 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P1P1P1P1 P1P1P1P1 P1P1P1P1 Fault-Tolerance Conditional Process Graph P1P1P1P1 P2P2P2P2 m1m1 k = 2 1 2 3 1 2 3 4 5 6 m1m1 m1m1 m1m1 1 2 3 Conditional Scheduling
21
21 of 14 21 Conditional Schedule Table true P1P1 0 m1m1 P2P2 45 40 50 90 130 140160 105 150 85 95 P1P1P1P1 P2P2P2P2 m1m1 k = 2 N1N1 N2N2
22
22 of 14 22 Conditional Scheduling Conditional scheduling: Generates short schedules Allows to trade-off between transparency and performance (to be discussed later...) – Requires a lot of memory to store schedule tables – Scheduling algorithm is very slow Alternative: shifting-based scheduling
23
23 of 14 23 Shifting-based Scheduling Messages sent over the bus should be scheduled at one time Faults on one computation node must not affect other computation nodes Requires less memory Schedule generation is very fast – Schedules are longer – Does not allow to trade-off between transparency and performance (to be discussed later...)
24
24 of 14 24 Ordered FT-CPG k = 2 P1P1P1P1 P3P3P3P3 P4P4P4P4 P2P2P2P2 m2m2 m3m3 m1m1 P 2 after P 1 P 3 after P 4 P3P3P3P3 P3P3P3P3 P3P3P3P3 P3P3P3P3 P3P3P3P3 P3P3P3P3 P4P4P4P4 P4P4P4P4 P4P4P4P4 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P1P1P1P1 P1P1P1P1 P1P1P1P1 S S S 1 2 3 1 2 3 4 5 6 1 m3m3m3m3 m2m2m2m2 m1m1m1m1 2 3 1 2 3 4 5 5 P2P2P2P2
25
25 of 14 25 Root Schedules P2P2 m1m1 P1P1 m2m2 m3m3 P4P4 P3P3 N1N1 N2N2 Bus P1P1 P1P1 Worst-case scenario for P 1 Recovery slack for P 1 and P 2
26
26 of 14 26 Extracting Execution Scenarios P2P2 m1m1 P1P1 m2m2 m3m3 P 4/1 P3P3 N1N1 N2N2 Bus P 4/2 P 4/3
27
27 of 14 27 Memory Required to Store Schedule Tables 20 proc. 40 proc. 60 proc. 80 proc. k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3 100% 0.130.280.540.360.891.730.712.094.351.184.218.75 75% 0.220.571.370.622.064.961.204.6411.552.018.4021.11 50% 0.280.821.940.823.118.091.537.0918.282.5912.2134.46 25% 0.341.172.951.034.3412.561.9210.0028.313.0517.3051.30 0% 0.391.423.741.175.6116.722.1611.7234.623.4119.2861.85 Applications with more frozen nodes require less memory 1.73 4.96 8.09 12.56 16.72
28
28 of 14 28 Memory Required to Store Root Schedule 20 proc. 40 proc. 60 proc. 80 proc. k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3 100% 0.0160.0340.0540.070 Shifting-based scheduling requires very little memory 1.73 0.03
29
29 of 14 29 Schedule Generation Time and Quality Shifting-based scheduling much faster than conditional scheduling Shifting-based scheduling requires 0.2 seconds to generate a root schedule for application of 120 processes and 10 faults Conditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults ~15% worse than conditional scheduling with 100% inter-processor messages set to frozen (in terms of fault tolerance overhead)
30
30 of 14 30 Fault Tolerance Policy Assignment Checkpoint Optimization
31
31 of 14 31 Fault Tolerance Policy Assignment P 1/1 P 1/2 P 1/3 Re-execution N1N1 P 1(1) P 1(2) P 1(3) Replication N1N1 N2N2 N3N3 P 1(1)/1 P 1(2) N1N1 N2N2 P 1(1)/2 Re-executed replicas 2
32
32 of 14 32 Re-execution vs. Replication N1N1 N2N2 P1P1 P3P3 P2P2 m1m1 1 P1P1 P2P2 P3P3 N1N1 N2N2 4050 40 60 50 70 A1A1 P1P1 P3P3 P2P2 m1m1 m2m2 A2A2 N1N1 N2N2 bus P1P1 P2P2 P3P3 Missed Deadline P 1(1) N1N1 N2N2 bus P 1(2) P 2(1) P 2(2) P 3(1) P 3(2) Met m 1(2) m 1(1) m 2(2) m 2(1) Replication is better N1N1 N2N2 bus P1P1 P2P2 P3P3 Met Deadline P 1(1) N1N1 N2N2 bus P 1(2) P 2(1) P 2(2) P 3(1) P 3(2) Missed m 1(2) m 1(1) Re-execution is better
33
33 of 14 33 P1P1 N1N1 N2N2 bus P2P2 P3P3 P4P4 m2m2 Missed N1N1 N2N2 P 1(1) P 3(2) P 4(1) P 2(1) P 1(2) bus P 2(2) P 3(1) P 4(2) Missed m 1(2) m 1(1) m 2(1) m 2(2) m 3(1) m 3(2) N1N1 N2N2 P 1(1) P3P3 P4P4 P2P2 P 1(2) m 2(1) m 1(2) bus Met Optimization of fault tolerance policy assignment Fault Tolerance Policy Assignment P1P1 P2P2 P3P3 N1N1 N2N2 4050 60 80 P4P4 4050 1 N1N1 N2N2 P1P1 P4P4 P2P2 P3P3 m1m1 m2m2 m3m3 Deadline
34
34 of 14 34 Optimization Strategy Design optimization: Fault tolerance policy assignment Mapping of processes and messages Root schedules Three tabu-search optimization algorithms: 1.Mapping and Fault Tolerance Policy assignment (MRX) Re-execution, replication or both 2.Mapping and only Re-Execution (MX) 3.Mapping and only Replication (MR) Tabu-search Shifting-based scheduling
35
35 of 14 35 80 20 Experimental Results 0 10 30 40 50 60 70 90 100 20406080100 80 Mapping and replication (MR) 20 Mapping and re-execution (MX) Mapping and policy assignment (MRX) Number of processes Avgerage % deviation from MRX Schedulability improvement under resource constraints
36
36 of 14 36 N1N1 Checkpoint Optimization P1P1 P1P1 2 P1P1 1 P1P1 2 P 1/1 2 P 1/2 2 P1P1 2 P1P1 1 P1P1 2
37
37 of 14 37 Locally Optimal Number of Checkpoints 1 = 15 ms k = 2 1 = 5 ms 1 = 10 ms P1P1 C 1 = 50 ms No. of checkpoints P1P1 P1P1 2 12 P1P1 P1P1 P1P1 3 123 P1P1 P1P1 P1P1 P1P1 4 1234 P1P1 1 P1P1 P1P1 P1P1 P1P1 P1P1 5 12345
38
38 of 14 38 Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1 105 5 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = 2 265 P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 P1P1 P2P2 P2P2 P1P1 1212 255
39
39 of 14 39 Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1 105 5 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = 2 P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 a) 265 P1P1 P2P2 P2P2 P1P1 1212 b) 255
40
40 of 14 40 Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1 105 5 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = 2 P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 a) 265 P1P1 P2P2 P2P2 P1P1 1212 b) 255
41
41 of 14 41 0% 10% 20% 30% 40% 406080100 Global Optimization of Checkpoint Distribution (MC) % deviation from MC0 ( how smaller the fault tolerance overhead ) Application size (the number of tasks) 4 nodes, 3 faults Local Optimization of Checkpoint Distribution (MC0) Global Optimization vs. Local Optimization Does the optimization reduce the fault tolerance overheads on the schedule length?
42
42 of 14 42 Trading-off Transparency for Performance Mapping Optimization with Transparency
43
43 of 14 43 Good for debugging and testing FT Implementations with Transparency P2P2P2P2 P4P4P4P4 P3P3P3P3 P5P5P5P5 P1P1P1P1 m1m1 m2m2 – regular processes/messages – frozen processes/messages P3P3P3P3 Frozen Transparency is achieved with frozen processes and messages
44
44 of 14 44 N1N1 N2N2 P1P1 P2P2 P3P3 N1N1 30X 20 X X N2N2 P4P4 X30 = 5 ms k = 2 No Transparency Deadline P2P2P2P2 P1P1P1P1 P4P4P4P4 m2m2 m1m1 m3m3 P3P3P3P3 P2P2P2P2 m1m1 P1P1P1P1 m2m2 m3m3 P4P4P4P4 P3P3P3P3 no fault scenario N1N1 N2N2 bus P2P2P2P2 m1m1 P1P1P1P1 m2m2 P4P4P4P4 P3P3P3P3 P1P1P1P1 P4P4P4P4 m3m3 the worst-case fault scenario N1N1 N2N2 bus processes start at different times messages are sent at different times
45
45 of 14 45 Full Transparency Customized Transparency P2P2P2P2 P1P1P1P1 m2m2 m3m3 P3P3P3P3 P3P3P3P3 m1m1 P4P4P4P4 P3P3P3P3 Customized transparency P2P2P2P2 m1m1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 no fault scenario P2P2P2P2 m1m1 P1P1P1P1 P1P1P1P1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 P2P2P2P2 m1m1 P1P1P1P1 m2m2 P4P4P4P4 P3P3P3P3 P1P1P1P1 P4P4P4P4 m3m3 No transparency Deadline P2P2P2P2 m1m1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 P3P3P3P3 P3P3P3P3 Full transparency
46
46 of 14 46 Trading-Off Transparency for Performance 0% 25% 50% 75% 100% k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3 20 244463326092397411548831334886139 40 172943204058284972346090396697 60 122434133043193958285479325886 80 81622101829142739244166274373 Four (4) computation nodes Recovery time 5 ms Trading transparency for performance is essential 2940496066 How longer is the schedule length with fault tolerance? increasing transparency
47
47 of 14 47 m3m3 N1N1 N2N2 = 10 ms k = 2 Mapping with Transparency P3P3P3P3 P1P1P1P1 P4P4P4P4 m2m2 P6P6P6P6 m1m1 m4m4 P5P5P5P5 N2N2 P1P1 P2P2 P3P3 N1N1 30 P4P4 P5P5 40 50 60 40 P6P6 50 m1m1 N1N1 N2N2 bus P4P4P4P4 P1P1P1P1 P3P3P3P3 P2P2P2P2 P6P6P6P6 P5P5P5P5 optimal mapping without transparency Deadline N1N1 N2N2 bus P1P1P1P1 P3P3P3P3 P2P2P2P2 P6P6P6P6 m1m1 P5P5P5P5 P 4/2 P 4/3 P 4/1 the worst-case fault scenario for optimal mapping P2P2P2P2
48
48 of 14 48 N1N1 N2N2 = 10 ms k = 2 Mapping with Transparency P3P3P3P3 P1P1P1P1 P4P4P4P4 m2m2 P6P6P6P6 m1m1 m4m4 P2P2P2P2 P5P5P5P5 m3m3 N2N2 P1P1 P2P2 P3P3 N1N1 30 P4P4 P5P5 40 50 60 40 P6P6 50 bus Deadline N1N1 N2N2 m1m1 the worst-case fault scenario with transparency for “optimal” mapping P1P1P1P1 P3P3P3P3 P 2/1 P6P6P6P6 P5P5P5P5 P4P4P4P4 P 2/2 P 2/3 bus N1N1 N2N2 m2m2 the worst-case fault scenario with transparency and optimized mapping P1P1P1P1 P3P3P3P3 P2P2P2P2 P6P6P6P6 P5P5P5P5 P 4/1 P 4/2 P 4/3
49
49 of 14 49 Design Optimization Hill-climbing mapping optimization heuristic Fast Slow 2. Schedule Length Estimation (SE) 1. Conditional Scheduling (CS) Schedule length
50
50 of 14 50 Experimental Results 4 nodes 25% of processes and 50% of messages are frozen 15 applications k = 2 faultsk = 3 faultsk = 4 faults Recovery overhead = 5 ms SECSSECSSECS 20 processes 0.010.070.020.280.041.37 30 processes 0.130.390.192.930.2631.50 40 processes 0.321.340.5017.020.69318.88 How faster is schedule length estimation (SE) compared to conditional scheduling (CS)? 0.69s 318.88s Schedule length estimation (SE) is more than 400 times faster than conditional scheduling (CS)
51
51 of 14 51 Experimental Results 4 computation nodes 15 applications 25% of processes and 50% of messages are frozen Recovery overhead = 5 ms k = 2 faultsk = 3 faultsk = 4 faults 20 processes32.89%32.20%30.56% 30 processes35.62%31.68%30.58% 40 processes28.88%28.11%28.03% How much is the improvement when transparency is taken into account? 31.68% Schedule length of fault-tolerant applications is 31.68% shorter on average if transparency was considered during mapping
52
52 of 14 52 Outline Motivation Background and limitations of previous work Thesis contributions: Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency Conclusions and future work
53
53 of 14 53 Conclusions Scheduling with fault tolerance requirements Two novel scheduling techniques Handling customized transparency requirements, trading-off transparency for performance Fast scheduling alternative with low memory requirements for schedules
54
54 of 14 54 Conclusions Design optimization with fault tolerance Policy assignment optimization strategy Estimation-driven mapping optimization that can handle customized transparency requirements Optimization of the number of checkpoints Approaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller
55
55 of 14 55 Design Optimization of Embedded Systems with Fault Tolerance is Essential
56
56 of 14 56 Some More… Future Work Fault-Tree Analysis Probabilistic Fault Model Soft Real-Time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.