1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.

1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden Presentation of Licentiate Thesis

2 of 14 2  Hard real-time applications  Time-constrained  Cost-constrained  Fault-tolerant  etc. Motivation  Focus on transient faults and intermittent faults

3 of 14 3 Motivation Transient faults Radiation Electromagnetic interference (EMI) Lightning storms  Happen for a short time  Corruptions of data, miscalculation in logic  Do not cause a permanent damage of circuits  Causes are outside system boundaries

4 of 14 4 Motivation Intermittent faults Internal EMI Crosstalk Power supply fluctuations Init (Data) Software errors (Heisenbugs) Transient faults  Manifest similar as transient faults  Happen repeatedly  Causes are inside system boundaries

5 of 14 5 Motivation Errors caused by transient faults have to be tolerated before they crash the system However, fault tolerance against transient faults leads to significant performance overhead Transient faults are more likely to occur as the size of transistors is shrinking and the frequency is growing

6 of 14 6 Motivation  Hard real-time applications  Time-constrained  Cost-constrained  Fault-tolerant  etc. The Need for Design Optimization of Embedded Systems with Fault Tolerance

7 of 14 7 Outline  Motivation  Background and limitations of previous work  Thesis contributions:  Scheduling with fault tolerance requirements  Fault tolerance policy assignment  Checkpoint optimization  Trading-off transparency for performance  Mapping optimization with transparency  Conclusions and future work

8 of 14 8 General Design Flow System Specification Architecture Selection Mapping & Hardware / Software Partitioning Scheduling Back-end Synthesis Feedback loops Fault Tolerance Techniques

9 of 14 9 P1P1 0 20 40 60 N1N1 P1P1 2 P1P1 P1P1 1 2 P1P1 P1P1 P 1/1 Fault Tolerance Techniques Error-detection overhead  N1N1 Re-execution Checkpointing overhead  P1P1 P1P1 1 2 N1N1 Rollback recovery with checkpointing Recovery overhead  P 1(1) P 1(2) N1N1 N2N2 P 1(1) P 1(2) N1N1 N2N2 Active replication P 1/2 P1P1 1 P 1/1 2 P 1/2 2 N1N1 1

10 of 14 10 Limitations of Previous Work  Design optimization with fault tolerance is limited  Process mapping is not considered together with fault tolerance issues  Multiple faults are not addressed in the framework of static cyclic scheduling  Transparency, if at all addressed, is restricted to a whole computation node

11 of 14 11 Outline  Motivation  Background and limitations of previous work  Thesis contributions:  Scheduling with fault tolerance requirements  Fault tolerance policy assignment  Checkpoint optimization  Trading-off transparency for performance  Mapping optimization with transparency  Conclusions and future work

12 of 14 12 Fault-Tolerant Time-Triggered Systems Processes: Re-execution, Active Replication, Rollback Recovery with Checkpointing Messages: Fault-tolerant predictable protocol … Transient faults P2P2 P4P4 P3P3 P5P5 P1P1 m1m1 m2m2 Maximum k transient faults within each application run (system period)

13 of 14 13 Scheduling with Fault Tolerance Reqirements Conditional Scheduling Shifting-based Scheduling Conditional Scheduling

14 of 14 14 P1P1P1P1 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 k = 2 020406080100120140160180200

15 of 14 15 P1P1P1P1 P2P2P2P2 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 40 k = 2 020406080100120140160180200

16 of 14 16 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 P1P1P1P1 P 1/1 P 1/2 020406080100120140160180200

17 of 14 17 P2P2P2P2 020406080100120140 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 160180200 90 130 P 1/1 P 1/2 P 1/3

18 of 14 18 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 90 13014085 020406080100120140160180200 P 1/1 P 1/2 P2P2P2P2 P 2/1 P 2/2

19 of 14 19 P1P1P1P1 P2P2P2P2 m1m1 Conditional Scheduling true P1P1 0 P2P2 45 40 k = 2 90 130150 95 14085 020406080100120140160180200 P1P1P1P1 P2P2P2P2 P 2/1 P 2/2 P 2/3

20 of 14 20 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P1P1P1P1 P1P1P1P1 P1P1P1P1 Fault-Tolerance Conditional Process Graph P1P1P1P1 P2P2P2P2 m1m1 k = 2 1 2 3 1 2 3 4 5 6 m1m1 m1m1 m1m1 1 2 3 Conditional Scheduling

21 of 14 21 Conditional Schedule Table true P1P1 0 m1m1 P2P2 45 40 50 90 130 140160 105 150 85 95 P1P1P1P1 P2P2P2P2 m1m1 k = 2 N1N1 N2N2

22 of 14 22 Conditional Scheduling  Conditional scheduling:  Generates short schedules  Allows to trade-off between transparency and performance (to be discussed later...) – Requires a lot of memory to store schedule tables – Scheduling algorithm is very slow  Alternative: shifting-based scheduling

23 of 14 23 Shifting-based Scheduling  Messages sent over the bus should be scheduled at one time  Faults on one computation node must not affect other computation nodes  Requires less memory  Schedule generation is very fast – Schedules are longer – Does not allow to trade-off between transparency and performance (to be discussed later...)

24 of 14 24 Ordered FT-CPG k = 2 P1P1P1P1 P3P3P3P3 P4P4P4P4 P2P2P2P2 m2m2 m3m3 m1m1 P 2 after P 1 P 3 after P 4 P3P3P3P3 P3P3P3P3 P3P3P3P3 P3P3P3P3 P3P3P3P3 P3P3P3P3 P4P4P4P4 P4P4P4P4 P4P4P4P4 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P2P2P2P2 P1P1P1P1 P1P1P1P1 P1P1P1P1 S S S 1 2 3 1 2 3 4 5 6 1 m3m3m3m3 m2m2m2m2 m1m1m1m1 2 3 1 2 3 4 5 5 P2P2P2P2

25 of 14 25 Root Schedules P2P2 m1m1 P1P1 m2m2 m3m3 P4P4 P3P3 N1N1 N2N2 Bus P1P1 P1P1 Worst-case scenario for P 1 Recovery slack for P 1 and P 2

26 of 14 26 Extracting Execution Scenarios P2P2 m1m1 P1P1 m2m2 m3m3 P 4/1 P3P3 N1N1 N2N2 Bus P 4/2 P 4/3

27 of 14 27 Memory Required to Store Schedule Tables 20 proc. 40 proc. 60 proc. 80 proc. k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3 100% 0.130.280.540.360.891.730.712.094.351.184.218.75 75% 0.220.571.370.622.064.961.204.6411.552.018.4021.11 50% 0.280.821.940.823.118.091.537.0918.282.5912.2134.46 25% 0.341.172.951.034.3412.561.9210.0028.313.0517.3051.30 0% 0.391.423.741.175.6116.722.1611.7234.623.4119.2861.85  Applications with more frozen nodes require less memory 1.73 4.96 8.09 12.56 16.72

28 of 14 28 Memory Required to Store Root Schedule 20 proc. 40 proc. 60 proc. 80 proc. k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3 100% 0.0160.0340.0540.070  Shifting-based scheduling requires very little memory 1.73 0.03

29 of 14 29 Schedule Generation Time and Quality  Shifting-based scheduling much faster than conditional scheduling Shifting-based scheduling requires 0.2 seconds to generate a root schedule for application of 120 processes and 10 faults Conditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults ~15% worse than conditional scheduling with 100% inter-processor messages set to frozen (in terms of fault tolerance overhead)

30 of 14 30 Fault Tolerance Policy Assignment Checkpoint Optimization

31 of 14 31 Fault Tolerance Policy Assignment P 1/1 P 1/2 P 1/3 Re-execution N1N1 P 1(1) P 1(2) P 1(3) Replication N1N1 N2N2 N3N3 P 1(1)/1 P 1(2) N1N1 N2N2 P 1(1)/2 Re-executed replicas 2

32 of 14 32 Re-execution vs. Replication N1N1 N2N2 P1P1 P3P3 P2P2 m1m1 1 P1P1 P2P2 P3P3 N1N1 N2N2 4050 40 60 50 70 A1A1 P1P1 P3P3 P2P2 m1m1 m2m2 A2A2 N1N1 N2N2 bus P1P1 P2P2 P3P3 Missed Deadline P 1(1) N1N1 N2N2 bus P 1(2) P 2(1) P 2(2) P 3(1) P 3(2) Met m 1(2) m 1(1) m 2(2) m 2(1) Replication is better N1N1 N2N2 bus P1P1 P2P2 P3P3 Met Deadline P 1(1) N1N1 N2N2 bus P 1(2) P 2(1) P 2(2) P 3(1) P 3(2) Missed m 1(2) m 1(1) Re-execution is better

33 of 14 33 P1P1 N1N1 N2N2 bus P2P2 P3P3 P4P4 m2m2 Missed N1N1 N2N2 P 1(1) P 3(2) P 4(1) P 2(1) P 1(2) bus P 2(2) P 3(1) P 4(2) Missed m 1(2) m 1(1) m 2(1) m 2(2) m 3(1) m 3(2) N1N1 N2N2 P 1(1) P3P3 P4P4 P2P2 P 1(2) m 2(1) m 1(2) bus Met Optimization of fault tolerance policy assignment Fault Tolerance Policy Assignment P1P1 P2P2 P3P3 N1N1 N2N2 4050 60 80 P4P4 4050 1 N1N1 N2N2 P1P1 P4P4 P2P2 P3P3 m1m1 m2m2 m3m3 Deadline

34 of 14 34 Optimization Strategy  Design optimization:  Fault tolerance policy assignment  Mapping of processes and messages  Root schedules  Three tabu-search optimization algorithms: 1.Mapping and Fault Tolerance Policy assignment (MRX)  Re-execution, replication or both 2.Mapping and only Re-Execution (MX) 3.Mapping and only Replication (MR) Tabu-search Shifting-based scheduling

35 of 14 35 80 20 Experimental Results 0 10 30 40 50 60 70 90 100 20406080100 80 Mapping and replication (MR) 20 Mapping and re-execution (MX) Mapping and policy assignment (MRX) Number of processes Avgerage % deviation from MRX Schedulability improvement under resource constraints

36 of 14 36 N1N1 Checkpoint Optimization P1P1 P1P1 2 P1P1 1 P1P1 2 P 1/1 2 P 1/2 2 P1P1 2 P1P1 1 P1P1 2

37 of 14 37 Locally Optimal Number of Checkpoints  1 = 15 ms k = 2  1 = 5 ms  1 = 10 ms P1P1 C 1 = 50 ms No. of checkpoints P1P1 P1P1 2 12 P1P1 P1P1 P1P1 3 123 P1P1 P1P1 P1P1 P1P1 4 1234 P1P1 1 P1P1 P1P1 P1P1 P1P1 P1P1 5 12345

38 of 14 38 Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1   105 5 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = 2 265 P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 P1P1 P2P2 P2P2 P1P1 1212 255

39 of 14 39 Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1   105 5 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = 2 P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 a) 265 P1P1 P2P2 P2P2 P1P1 1212 b) 255

40 of 14 40 Globally Optimal Number of Checkpoints P2P2 P1P1 m1m1   105 5 P1P1 P2P2 P1P1 C 1 = 50 ms P2P2 C 2 =60 ms k = 2 P1P1 P1P1 P1P1 123 P2P2 P2P2 P2P2 123 a) 265 P1P1 P2P2 P2P2 P1P1 1212 b) 255

41 of 14 41 0% 10% 20% 30% 40% 406080100 Global Optimization of Checkpoint Distribution (MC) % deviation from MC0 ( how smaller the fault tolerance overhead ) Application size (the number of tasks) 4 nodes, 3 faults Local Optimization of Checkpoint Distribution (MC0) Global Optimization vs. Local Optimization Does the optimization reduce the fault tolerance overheads on the schedule length?

42 of 14 42 Trading-off Transparency for Performance Mapping Optimization with Transparency

43 of 14 43 Good for debugging and testing FT Implementations with Transparency P2P2P2P2 P4P4P4P4 P3P3P3P3 P5P5P5P5 P1P1P1P1 m1m1 m2m2 – regular processes/messages – frozen processes/messages P3P3P3P3 Frozen Transparency is achieved with frozen processes and messages

44 of 14 44 N1N1 N2N2 P1P1 P2P2 P3P3 N1N1 30X 20 X X N2N2 P4P4 X30  = 5 ms k = 2 No Transparency Deadline P2P2P2P2 P1P1P1P1 P4P4P4P4 m2m2 m1m1 m3m3 P3P3P3P3 P2P2P2P2 m1m1 P1P1P1P1 m2m2 m3m3 P4P4P4P4 P3P3P3P3 no fault scenario N1N1 N2N2 bus P2P2P2P2 m1m1 P1P1P1P1 m2m2 P4P4P4P4 P3P3P3P3 P1P1P1P1 P4P4P4P4 m3m3 the worst-case fault scenario N1N1 N2N2 bus processes start at different times messages are sent at different times

45 of 14 45 Full Transparency Customized Transparency P2P2P2P2 P1P1P1P1 m2m2 m3m3 P3P3P3P3 P3P3P3P3 m1m1 P4P4P4P4 P3P3P3P3 Customized transparency P2P2P2P2 m1m1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 no fault scenario P2P2P2P2 m1m1 P1P1P1P1 P1P1P1P1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 P2P2P2P2 m1m1 P1P1P1P1 m2m2 P4P4P4P4 P3P3P3P3 P1P1P1P1 P4P4P4P4 m3m3 No transparency Deadline P2P2P2P2 m1m1 P1P1P1P1 P4P4P4P4 m2m2 m3m3 P3P3P3P3 P3P3P3P3 P3P3P3P3 Full transparency

46 of 14 46 Trading-Off Transparency for Performance 0% 25% 50% 75% 100% k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3k=1k=2k=3 20 244463326092397411548831334886139 40 172943204058284972346090396697 60 122434133043193958285479325886 80 81622101829142739244166274373 Four (4) computation nodes Recovery time 5 ms  Trading transparency for performance is essential 2940496066  How longer is the schedule length with fault tolerance? increasing transparency

47 of 14 47 m3m3 N1N1 N2N2  = 10 ms k = 2 Mapping with Transparency P3P3P3P3 P1P1P1P1 P4P4P4P4 m2m2 P6P6P6P6 m1m1 m4m4 P5P5P5P5 N2N2 P1P1 P2P2 P3P3 N1N1 30 P4P4 P5P5 40 50 60 40 P6P6 50 m1m1 N1N1 N2N2 bus P4P4P4P4 P1P1P1P1 P3P3P3P3 P2P2P2P2 P6P6P6P6 P5P5P5P5 optimal mapping without transparency Deadline N1N1 N2N2 bus P1P1P1P1 P3P3P3P3 P2P2P2P2 P6P6P6P6 m1m1 P5P5P5P5 P 4/2 P 4/3 P 4/1 the worst-case fault scenario for optimal mapping P2P2P2P2

48 of 14 48 N1N1 N2N2  = 10 ms k = 2 Mapping with Transparency P3P3P3P3 P1P1P1P1 P4P4P4P4 m2m2 P6P6P6P6 m1m1 m4m4 P2P2P2P2 P5P5P5P5 m3m3 N2N2 P1P1 P2P2 P3P3 N1N1 30 P4P4 P5P5 40 50 60 40 P6P6 50 bus Deadline N1N1 N2N2 m1m1 the worst-case fault scenario with transparency for “optimal” mapping P1P1P1P1 P3P3P3P3 P 2/1 P6P6P6P6 P5P5P5P5 P4P4P4P4 P 2/2 P 2/3 bus N1N1 N2N2 m2m2 the worst-case fault scenario with transparency and optimized mapping P1P1P1P1 P3P3P3P3 P2P2P2P2 P6P6P6P6 P5P5P5P5 P 4/1 P 4/2 P 4/3

49 of 14 49 Design Optimization Hill-climbing mapping optimization heuristic Fast Slow 2. Schedule Length Estimation (SE) 1. Conditional Scheduling (CS) Schedule length

50 of 14 50 Experimental Results 4 nodes 25% of processes and 50% of messages are frozen 15 applications k = 2 faultsk = 3 faultsk = 4 faults Recovery overhead  = 5 ms SECSSECSSECS 20 processes 0.010.070.020.280.041.37 30 processes 0.130.390.192.930.2631.50 40 processes 0.321.340.5017.020.69318.88  How faster is schedule length estimation (SE) compared to conditional scheduling (CS)? 0.69s 318.88s Schedule length estimation (SE) is more than 400 times faster than conditional scheduling (CS)

51 of 14 51 Experimental Results 4 computation nodes 15 applications 25% of processes and 50% of messages are frozen Recovery overhead  = 5 ms k = 2 faultsk = 3 faultsk = 4 faults 20 processes32.89%32.20%30.56% 30 processes35.62%31.68%30.58% 40 processes28.88%28.11%28.03%  How much is the improvement when transparency is taken into account? 31.68% Schedule length of fault-tolerant applications is 31.68% shorter on average if transparency was considered during mapping

52 of 14 52 Outline  Motivation  Background and limitations of previous work  Thesis contributions:  Scheduling with fault tolerance requirements  Fault tolerance policy assignment  Checkpoint optimization  Trading-off transparency for performance  Mapping optimization with transparency  Conclusions and future work

53 of 14 53 Conclusions  Scheduling with fault tolerance requirements  Two novel scheduling techniques  Handling customized transparency requirements, trading-off transparency for performance  Fast scheduling alternative with low memory requirements for schedules

54 of 14 54 Conclusions  Design optimization with fault tolerance  Policy assignment optimization strategy  Estimation-driven mapping optimization that can handle customized transparency requirements  Optimization of the number of checkpoints  Approaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller

55 of 14 55 Design Optimization of Embedded Systems with Fault Tolerance is Essential

56 of 14 56 Some More… Future Work Fault-Tree Analysis Probabilistic Fault Model Soft Real-Time

1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.

Similar presentations

Presentation on theme: "1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.

Similar presentations

Presentation on theme: "1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden."— Presentation transcript:

Similar presentations

About project

Feedback