Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL Lecture #12: Reliable Embedded.

2 Reading List for this Lecture “The Model Checker Spin”, IEEE Trans. on Software Engineering, Vol. May 1997. D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler.“The nesC Language: A Holistic Approach to Networked Embedded Systems”. Proceedings of Programming Language Design and Implementation (PLDI) 2003, June 2003. G. Necula, S. McPeak, S.P. Rahul, and W. Weimer. "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs".Proceedings of Conference on Compiler Construction, 2002. Mark Weiser. “Program Slicing”. 5 th International Conference on Software Engineering. 1981. Robert Wahbe, Steven Lucco, Thomas E. Anderson, Susan L. Graham, “Efficient software-based fault isolation,” Proceedings of the fourteenth ACM symposium on Operating systems principles (SOSP-93). –http://citeseer.ist.psu.edu/wahbe93efficient.htmlhttp://citeseer.ist.psu.edu/wahbe93efficient.html Nial Murphy, “Watchdog Timers,” Embedded Systems Programming –http://www.embedded.com/2000/0011/0011feat4.htmhttp://www.embedded.com/2000/0011/0011feat4.htm

3 Outline Overview of design process Static analysis –Concurrency –Memory usage Runtime monitoring –Detection –Isolation Hardware support Conclusions Implementation (Static analysis) Deployment (Runtime monitoring) Testing Specification

4 Overview of Software Design Process Specification –Understand task and constraints –Develop formal models for protocols –“The Model Checker Spin”, IEEE Trans. on Software Engineering, Vol. May 1997. Testing –Feed inputs –Stress test –Long test ● Implementation* – Coding standards – Code reviews and pair programming – Static analysis ● Deployment* – Fault detection – Isolation – Feedback

5 What and Why of Static Analysis “Testing and verification of a system without running the code” Specification may not be implemented correctly Not all errors appear during test runs –Concurrency problems with timing dependence –Faults under specific system loads Complements other techniques Early detection such as type checking

6 Techniques Create abstract model of the program –Direct reasoning about code is hard –Basic blocks or AST –G. Necula, S. McPeak, S.P. Rahul, and W. Weimer. "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs".Proceedings of Conference on Compiler Construction, 2002. Examine the model –Mark Weiser. “Program Slicing”. 5 th International Conference on Software Engineering. 1981. –Dataflow to track state through a program #include int main() { int x; int y; x = rand() % 10; y = rand() % 9; if(x>y) { x = x * x / 2; } else { x = y / 2; y = y * x; } printf("X+Y = %d", x+y); return 0; }

7 Example: Concurrency Problem –Shared data can be corrupted by concurrent accesses –Concurrency is a problem even without threading (why?) Solution –Annotate atomic code blocks –Infer what must be protected –Verify protection by looking at code base D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler.“The nesC Language: A Holistic Approach to Networked Embedded Systems”. Proceedings of Programming Language Design and Implementation (PLDI) 2003, June 2003.

8 Example: Memory Management Problem –Dynamic memory in embedded applications can result in difficult to understand bugs and strange errors –Dangling pointers, memory leaks, data corruption ● Important benefits of dynamic memory – Significantly simplify code base – Dynamic Memory Allocation in Embedded Apps? – http://ask.slashdot.org/article.pl?sid=05/11/16/2236235&tid=156&tid=201&tid=4 int *p = malloc (sizeof(int)*num); int *q = malloc (sizeof(int)*num*2); int *r = p;... free(r);... if (p[0] == 0) launchMissile();

9 Model for Memory Formalized by Shane Markstrum

10 Implementation Convert module into an AST Use data flow to track annotated data __attribute__((sos_claim)) __attribute__((sos_release)) ● Must either: – persistently store data once – free data – release data to ownership of another module ● Must not create any persistent references to data before call ● Must treat data as dead after the call caller - callee -

11 Outline Run Time Techniques –Operate during the execution of system –Access to more information than the static analysis tools –Introduce performance overheads Fault Isolation –Localize the impact of the fault –Specifically looking at memory corruption faults Fault Tolerance –Detect and recover from a fault Restore to a known good state Re-initialize the state –Specifically looking at hardware/architecture based techniques

12 Memory Corruption Fault Within Single Address Space A program is free to access the entire address space Memory Corruption Fault –Very easy for a program to corrupt the state of other programs Desktop/Server CPUs have MMU –No MMU in Embedded Processors (esp. micro-controllers) –Power, Performance, Cost … blah blah Middleware Operating System Applications Run-time Stack Global Data and Heap Single Continuous Address Space Single Continuous Address Space Program MemoryData Memory

13 Software Fault Isolation (SFI) Re-write the program to perform fault isolation in software –Simple but a very powerful concept –Useful even in servers/desktops for high performance application extensions, kernel extensions etc. Trade slower instrumented code for more protection –No need for a hardware protection boundary Slogan - You can still shoot yourself in the foot, but you can’t shoot the other guy in the foot Ack. Prof. Aiken UCB

14 Overview Maintain two invariants for isolated code Any jumps stay within the isolated code Any writes are to data belonging to the isolated code Idea: Divide the address-space into segments –Segment addresses have unique high-order bits Protection subdomains are defined by segments –Every write must be within the segment –Every jump must be within the segment

15 Fault Domain Run-time Stack Sampling Application Operating System Middleware Operating System Sampling Application Fault Domains No jumps outside fault domain No writes outside fault domain PROG DATA

16 Implementation - Segment Matching Replace each store by the sequence: dedicated-reg  target address Move target address into dedicated register srcatch-reg  (dedicated-reg >> shift-reg) Right shift address to get segment identifier Shift-reg is dedicated scratch-reg == segment-reg Compare segment identifier with current segment Segment-reg is dedicated trap if not equal Trap if store address is outside of the segment store through dedicated-reg Guaranteed to store at the correct address

17 Comments Segment matching overhead –4 instructions for EVERY store instruction in the program Requires three dedicated registers –Dedicated-reg holds the address being computed –Segment-reg holds current valid segment –Shift-size holds the size of the shift to perform –These three registers will not be used in the program Why dedicated registers ? –What will happen if a jump instruction by-passes all checks ? –What will happen if a jump lands in the middle of the checks ?

18 Sandboxing - Faster Approach Idea –Don’t test the segment bits –Just overwrite segment bits with correct segment dedicated-reg  (target-reg & and-mask-reg) Use dedicated register and-mask-reg to clear segment identifier bits dedicated-reg  dedicated-reg | segment-reg Use dedicated register segment-reg to set segment identifier bits This is much faster –Only two instructions per instrumentation point Loses information about errors –Program may keep running with incorrect instructions and data

19 Implementation Details Optimizations –Traditional compiler optimizations Move sandboxing out of the loop –Don’t instrument statically verifiable writes and jumps Binary instrumentation –Most portable & easily deployed –Also the hairiest option –Need to verify the binary No use of dedicated registers Modified compiler –Less easy to adopt –But easier to implement

20 Things to ponder about … How will the applications residing in their respective fault domains communicate with one another ? How will the data be shared amongst the fault domains? How will SFI be implemented on micro-controllers with less than 1 KB of memory ?

21 Embedded Systems In Real World Used in inaccessible places –Controllers for space vehicles - MARS Pathfinder –Closer home … sensor networks in dense forests Used for critical applications –Brake-by-wire systems –Medical Instruments Unexpected faults –Cosmic rays may flip on-chip bits Hard or even impossible to produce perfect firmware –Strive to design our systems to cleanly handle failures

22 Watchdog Timer Hardware Hardware counter that is set to an initial value Continually counts down to zero Responsibility of the software to set the count to original value When the counter reaches zero, the software is assumed to have failed Perform any suitable recovery –Typically reset the CPU Visual Metaphor –“If the man stops kicking the dog, the dog will take advantage of the hesitation and bite the man.”

23 Failures detected by watchdog Catch events that hang the system Transient Failures –Power glitches may corrupt program counter, stack pointer or even the data in RAM Software Bugs –Infinite loops –Accidental jump out of code memory –Deadlock conditions (Incorrect design) Watchdog guarantees that none of the bugs will hang the system indefinitely

24 Watchdog Design Considerations First Aid - Recovery from watchdog bite Maintain a count of number of resets –Shutdown a persistently errant application Use watchdog for sanity checks –Verify the control flow through a piece of code –Record failure reports in non-volatile storage –Diagnostic information is very useful Choosing watchdog timeout interval –Need to understand the timing characteristics of the program –Large interval - Slow response –Small interval - Frequent resets, difficult to diagnose Space Shuttle’s main engine controller –WDT timeout 18 ms –Switchover to a backup computer

25 Watchdog Self Test What if WDT fails in a way that it never bites ? Would be discovered only if a failure hangs the system WDT failure is VERY EASILY possible –WDT can be disabled in software –HW Misconfiguration - Jumper of reset line pulled out Startup self-test –Allow WDT to timeout and reset the processor –Flag to distinguish power on reset from WDT reset

26 Grenade Timer Idea - Build a counter that cannot be reloaded once it is running –Grenade whose pin has been pulled will have to explode Guaranteed reboot is a “useful feature” in some applications –Purges all bad state and re-initializes the system Grenade Timer HW Interface

27 Taxonomy FAILURE –Event that occurs when the delivered service deviates from the correct service –Failure is the effect that is observed –E.g. - “Your iPod Nano stops responding.” FAULT –Fault is the cause of an error –An error may lead to failure –E.g. - “Memory corruption fault lead to the failure of the iPod”

Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL Lecture #12: Reliable Embedded.

Similar presentations

Presentation on theme: "Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL Lecture #12: Reliable Embedded."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL Lecture #12: Reliable Embedded.

Similar presentations

Presentation on theme: "Copyright © 2005 EEM202A/CSM213A - Fall 2005 Ram Kumar & Roy Shea UCLA - NESL Lecture #12: Reliable Embedded."— Presentation transcript:

Similar presentations

About project

Feedback