A Simplified Approach to Fault Tolerant State Machine Design for Single Event Upsets Melanie Berg
D219/MAPLD 2004SLIDE 2Berg Overview n Presentation describes “Hardened by Design” techniques at a high level of abstraction… FGPA/ASIC logic Design n Background — Definition of Fault Tolerance — State Machines — Synchronous Design Theory n Proposed Method of SEU detection n Proposed Method of SEU correction
D219/MAPLD 2004SLIDE 3Berg Definition of Fault Tolerance n Masking or recovering from erroneous conditions in a system once they have been detected n The degree of fault tolerance implementation is defined by your system level requirements… I.e. what actually is acceptable behavior upon error n Questions that must be answered within the system requirements documentation: — Does your system only need to detect an error? — How quickly must the system respond to an error? — Must your system also correct the error? — Is the system susceptible to more than one error per clock cycle?
D219/MAPLD 2004SLIDE 4Berg Synchronous Design with Asynchronous Events n This discussion focuses on sequential Single Event Upsets (SEUs) within a synchronous design environment. n The SEU is considered a soft (temporary) error which has occurred due to a DFF being hit by a charged particle. n Configuration or SRAM errors will not be considered n Although the design is synchronous, it is very important to note that the SEU is an asynchronous event… — Generally not taken into account — Metastability and unpredictable events can occur — Can invoke a SEFI
D219/MAPLD 2004SLIDE 5Berg Common Fault Tolerant Implementation n Triple Mode Redundancy (TMR) is the most commonly implemented solution of SEU tolerance. — Why …. Because it is a very simple solution n In many cases it is not implemented correctly — Glitches within the TMR voting logic (due to mitigation across separate clock domains or hazardous combinational logic) must be taken into account incase a SEU occurs near a clock edge n TMR can be very area extensive
D219/MAPLD 2004SLIDE 6Berg Glitches in TMR Circuitry: Example
D219/MAPLD 2004SLIDE 7Berg Glitchy TMR Circuitry Continued
D219/MAPLD 2004SLIDE 8Berg Proposed EDAC Methodology n Goal: The proposed EDAC techniques are: — Targeted for synchronous Finite State Machine Designs — Less area extensive than TMR — Glitch Free and synchronous: Reduces the rate of SEFI n Note: Synchronous Design techniques referred to in this presentation are derived from the ASIC industry and are implemented using HDL… — DFF data inputs should not change within the setup and hold of the DFF: Metastability and unpredictable functionality will occur — Within a synchronous design, metastability will only happen at clock domain crossings…Must use metastability filters (synchronizers) to protect against these Asynchronous events — Synchronous design theory minimizes clock boundary crossings — This is a challenge when SEUs can occur at any point in time anywhere in the circuit
D219/MAPLD 2004SLIDE 9Berg Synchronous State Machines n A Finite State Machine (FSM) is designed to deterministically transition through a pattern of defined states n A synchronous FSM utilizes flip-flops to hold its currents state, transitions according to a clock edge and only accepts inputs that have been synchronized to the same clock n Generally FSMs are utilized as control mechanisms n Concern/Challenge: — If an SEU occurs within a FSM, the entire system can lock up into an unreachable state: SEFI!!!
D219/MAPLD 2004SLIDE 10Berg Synchronous State Machines n The structure consists of four major parts: — Inputs — Current State Register — Next State Logic — Output logic
D219/MAPLD 2004SLIDE 11Berg Encoding Schemes n Each state of a FSM must be mapped into some type of encoding (pattern of bits) n Once the state is mapped, it is then considered a defined (legal) state n Unmapped bit patterns are illegal states
D219/MAPLD 2004SLIDE 12Berg Encoding Schemes
D219/MAPLD 2004SLIDE 13Berg Safe State Machines??? n A “Safe” State Machine has been defined as one that: — Has a set of defined states — Can deterministically jump to a defined state if an illegal state has been reached (due to a SEU). n Synthesis tools offer a “Safe” option (demand from our industry): TYPE states IS ( IDLE, GET_DATA, PROCESS_DATA, SEND_DATA, BAD_DATA ); SIGNAL current_state, next_state : states; attribute SAFE_FSM: Boolean; attribute SAFE_FSM of states: type is true; n However…Designers Beware!!!!!!! — Synthesis Tools Safe option is not deterministic if an SEU occurs near a clock edge!!!!!
D219/MAPLD 2004SLIDE 14Berg Binary Encoding: How Safe is the “Safe” Attribute? n If a Binary encoded FSM flips into an illegal (unmapped) state, the safe option will return the FSM into a known state that is defined by the others or default clause n If a Binary encoded FSM flips into a good state, this error will go undetected. — If the FSM is controlling a critical output, this phenomena can be very detrimental! — How safe is this?
D219/MAPLD 2004SLIDE 15Berg Safe State Machines???
D219/MAPLD 2004SLIDE 16Berg One-Hot vs. Binary n There used to be a consensus suggesting that Binary is “safer” than One-Hot — Based on the idea that One-Hot requires more DFFs to implement a FSM thus has a higher probability of incurring an error n This theory has been changed! — Most of the community now understands that although One- Hot requires more registers, it has the built-in detection that is necessary for safe design — Binary encoding can lead to a very “un-safe” design
D219/MAPLD 2004SLIDE 17Berg Proposed SEU Error Detection: One-Hot n One-Hot requires only one bit be active high per clock period n If more than one bit is turned on, then an error will be detected. n Combinational XNOR over the FSM bits is sufficient for SEU detection… even if a SEU occurs near a clock edge n A MUX can be used to transition the current state into a defined “ERROR STATE” if the parity check fails n If the system can not receive Multiple Event Upsets within one clock period, then the circuitry can never flip into a legal state (illegally)!
D219/MAPLD 2004SLIDE 18Berg FSM SEU: Error Correction : Using Companion States n There exists many publications on Error Correction theory. n None directly address how to correctly implement FSM fault correction while using current day synthesis tools. — Glitch control: Generally synthesis tools will produce “glitchy” logic — Synthesis “optimization” algorithms will erase the necessary redundancy for EDAC — The user must sometimes hand instantiate logic — The user must place the necessary attributes to avoid redundant logic erasure.
D219/MAPLD 2004SLIDE 19Berg Error Correction within One Cycle: Using Companion States n We’ll base the derivation off of a 4 state FSM:
D219/MAPLD 2004SLIDE 20Berg Error Correction within One Cycle: Using Companion States n 1. Find an encoding such that the states have a hamming distance of 3 (at least 3 bits must be different from state to state)... — (state-A), — 11100(state-B), — 01111(state-C), — 10011(state-D). — Five bits are necessary to encode a four-state machine in order to achieve the required hamming distance of three.
D219/MAPLD 2004SLIDE 21Berg Error Correction within One Cycle: Using Companion States 2. For each encoding, calculate the companion encodings such that the hamming distance is one… for example: — Companion encoding for state A (00000) is: 00001,00010,00100,01000,10000 — Companion encoding for state B (11100) is: 11101,11110,11001,10100,01100
D219/MAPLD 2004SLIDE 22Berg Error Correction within One Cycle: Using Companion States n When implementing the state machine, state A is encoded as and then (theoretically) “OR-ed” with all of its companion encodings. This covers all possible SEUs n Do the same for all other states n Use the output of the “OR-ed” states to determine next state logic. — Thus if a bit flips… the companion state will catch it and the FSM will be able to correctly determine the next state n Be careful! The “OR” logic is more complex than simply using a string of “OR” gates.
D219/MAPLD 2004SLIDE 23Berg Error Correction within One Cycle: Glitch Control n One major issue that is extremely overlooked is SEUs occurring near clock edges n If this occurs, your error checking logic may cause a glitch n Due to routing timing differences, this can cause incorrect values to be latched into the current state registers. n Refer to a Karnaugh Map for glitch-less implementation n The designer may have to hand instantiate the logic if the synthesis tool does not adhere to the VHDL as expected
D219/MAPLD 2004SLIDE 24Berg Error Correction within One Cycle: Glitch Control
D219/MAPLD 2004SLIDE 25Berg Error Correction within One Cycle: Glitch Control n The designer will have to include the synthesis directives in order to turn off the tools “optimization”: — Preserve_driver — Preserve_signal n Always check the gate level output of the synthesis tool.
D219/MAPLD 2004SLIDE 26Berg Conclusion n This presentation proposes methods of Fault Tolerant State Machine implementation due to potential IC SEU susceptibility. n Be aware of potential glitches due to asynchronous SEUs occurring near a clock edge… — Mitigation Techniques must be Glitch Free! — Mitigation may need a synchronization circuit — Due to metastability and routing delay differences, can be more catastrophic than expected n Special directives must be used in order to drive the synthesis tools when implementing fault tolerant redundant logic because the tools are generally focused on area and speed optimization.