Fault Tolerance in the Systems-on-Chip Era

Fault Tolerance in the Systems-on-Chip Era
Andreas Steininger

Contributors to this Material
Jakob Lechner, former PostDoc, now with RUAG Robert Najvirt, PhD at TU Wien Thomas Polzer, PostDoc at TU Wien

Outline Why GALS? Interfacing between uncorrelated clock domains
synchronizers pausible clocking Protection of the clock Protecting asynchronous data paths fault-tolerant delay-insensitive codes Redundancy in GALS architectures synchronizing voting and recovery

The Beauty of Synchrony
activities need to be co-ordinated on system level (braking of wheels, …) on algorithmic level (consensus, …) on communication level on logic level (state machine switching,…) need a global notion of time (discrete „ticks“ + ordering) ms ps

A Fundamental Choice retaining synchrony re-gaining synchrony
ps retaining synchrony single clock source, accurate distribution uncorrelated clocks, metastability issues re-gaining synchrony ms

Globally Synchronous Design
whole design is „isochronic“ („perfect“ precision) retains precise synchrony all over the system on all levels assumes knowledge of all circuit delays assumes perfect clock distribution very efficient design consistent states high level of abstraction very efficient implementation: single crystal oscillator single control line (clock net)

The Clock Distribution Problem
speed of light (in medium) = 2 x 108 m/s = 20cm/ns Ref 2cm 1GHz 4GHz 8GHz

The Variation Problem Designer User ?(unknown) projected conditions
actual conditions worst case system model ?(imperfections) actual system safety margins Timing completely fixed after design No way to react to actual conditions & system („PVT variations“)

A Comparison

The Dawn of Synchronous Design?
cannot adapt to PVT variations rigid timing allows no graceful degradation clock distribution extremely cumbersome continuous clocking wastes energy …

Alternative: Asynchronous Design
co-ordination based on handshaking (closed loop!) REQ: „Data word valid, you can use it“ f(x) SRC SNK ACK: „Data word consumed, send the next“

Async. Design – Advantages
closed-loop control makes timing much more robust and adaptive to PVT variations no need for worst-case timing local handshakes replace global clock activity only when needed beneficial for EMI tends to stop operation in case of fault …

Async. Design – Techniques
Need to handle race between REQ and data

Async. Design – Techniques
Need to handle race between REQ and data REQ: „Data word valid, you can use it“ f(x) SRC SNK

Async. Design – Bundled Data
Need to handle race between REQ and data Solution 1: „Bundled Data“ REQ: „Data word valid, you can use it“ f(x) SRC SNK

Async. Design – Delay Insensitive
Need to handle race between REQ and data Solution 2: „Delay Insensitive“ (Coding) REQ: „Data word valid, you can use it“ Completion detection f(x) SRC SNK

Async. Design – Issues significant HW overhead (coding, delay elements) „adaptive“ timing = not as predictable way more difficult to design classical fault-tolerance schemes not applicable („wait for all“ paradigm) testability? CAD tools?

Best of Both Worlds GALS: Globally Asynchronous Locally Synchronous
retain efficiency of synchronous design wherever possible: „intra-module“ use asynchronous principle where clock distribution too cumbersome: „inter-module“ First mention in PhD thesis by Chapiro / Stanford 84

A GALS Example CPU 2GHz DSP 2,7GHz Phase synchronous (single source) clock neither attainable nor desirable PCI-IF 533MHz USB-IF 24MHz

Benefits of GALS clock distribution only for local islands
each function can have its optimal clock clock is no more single point of failure lower noise (conducted & radiated)

A Fundamental Choice retaining synchrony re-gaining synchrony
ps retaining synchrony single clock source, accurate distribution uncorrelated clocks, metastability issues re-gaining synchrony ms

Correlated or not correlated?
synchronous identical frequency, constant phase relation classical synchronous system driven by one clock source mesochronous identical frequency (no accumulating drift) but unknown, constant phase shift (bounded) example: unbalanced clock tree ratiochronous fixed (known) frequency ratio, identical source example: source clock divided by different values

Correlated or not correlated?
plesiochronous same nominal clock frequency, mutual (low) drift independent clock sources with same nominal frequency heterochronous clocks totally unrelated but periodic independent clock sources with different nominal frequency aperiodic events arrive totally unrelated to clock sporadic event (pushbutton) needs to be synchronized

Metastability Ignoring setup/hold constraints of a flip flop causes
intermediate („analog“) output voltages delayed transitions glitches for uncorrelated clocks there is no magic way of completely avoiding metastability but: resulting upsets can be made improbable

Communication in GALS Boundary Synchronizers Shared Memory
direct data exchange, handshake + synchronizer Shared Memory data exchange decoupled through memory, needs arbitration Dual-Clock FIFOs data exchange buffered through FIFO-queue, status flags need synchronization Local Clock Stretching direct data exchange, need mutex to halt receiver clock

A problematic Solution…
CPU 2GHz DSP 2,7GHz Two uncorrelated clock domains: Need to establish protection against metastability!

A problematic Fix… CPU 2GHz DSP 2,7GHz S metastability correctly mitigated, but… individual (parallel) synchronizer paths may resolve inconsistently

Boundary Synchronizer
REQ S CPU 2GHz DSP 2,7GHz S ACK synchronizes a single signal only (REQ) captures data afterREQ properly received but: data needs to be available & stable => timing condition

REQ S CPU 2GHz DSP 2,7GHz S ACK synchronizer for REQ, ACK can make rate of metastable upsets arbitrarily low but never zero („time safe“ solution) and at the cost of performance with bad scalability for variations

Pausible Clocking CPU 2GHz DSP 2,7GHz latch
0xff14 pausible clock SRC: request SNK to stop clock SNK: acknowldege stopping of clock; open data latch (safe now!) SRC: release SNK clock blocking SNK: release ACK, close data latch; start clocking (data stable now!) *

Pausible Clock Implementation
Ring oscillator provides receiver clock: unstable & inaccurate Receiver must handle clock pausing Mutex reliably handles metastability issues: FR = 0

Time safe or value safe? Time safe Value safe FR = 0
need result at given point in time accept the risk that FF has not yet decided example: synchronizer Value safe take result only after decision accept that there is no time bound for this example: mutex FR = 0

Why pausible clocking is safe
value safe solution reaction to stop request may be delayed ack given only when actually stopped mutex has time to decide there may be an extra pulse at the end the price we need a pausible clock ring oscillators are inaccurate and unstable

The best of both Worlds? REQ crystal oscillator provides stable & accurate clock „pausing“ by gating REQ must be aligned with clock to avoid glitches

The best of both Worlds? REQ S crystal oscillator provides stable & accurate clock „pausing“ by gating REQ must be aligned with clock to avoid glitches need a synchronizer (there‘s no ingenious alternative) which buys us all the drawbacks of the boundary synchronizer solution

Synchronizing a pausible clock

What we get… value safe, FR = 0 minimum pulse width D, no glitches
in steady state output will directly follow reference stabilization interval after switching D must be smaller than T/2 reference and ring oscillator consume power

A versatile Glitch Filter
mutex removed, no ack min pulse width guaranteed time safe: may delay edges useful for SET filtering on clock clock attack prevention clock switching or selection clock self-repair

Self-Repairing Clock watchdog supervision glitch-free switch-over
redundant clk source redundant delay line

Why care for Clock Protection?
long lines prone to coupling, EMI many drivers prone to single-event effects global effect of faults no temporal masking

The Beauty of Delay-Insensitivity
self-regulating timing natural adaptation to PVT variations Completion detection f(x) SRC SNK

REQ S CPU 2GHz DSP 2,7GHz S ACK synchronizes a single signal only (REQ) captures data afterREQ properly received but: data needs to be available & stable => timing condition

Delay Insensitive Codes
arbitrary delays => bits may arrive in any order DI code: one can recognize complete data words required property: no code word covers an other popular DI codes: m-of-n codes Berger code Zero-sum code

Example: 2-of-5 code {00011, 01010, 00110} are valid code words {00000, 00010,11011} are invalid covers could be complete or an intermediate state of receiving 00011

Completion Detection with DI Code
we want to send start with zero (RTZ) intermediate word finally becomes would be a valid codeword, CD would trigger

Why DI is not fault tolerant
we want to send propagates as now a fault occurs this is a valid code word and triggers the CD before it turns into and finally (but ignored) 00011

Building fault tolerant DI codes (1)
Solution 1: further restrict the set of valid codewords (form a „subcode“) exclude those that can be confused in case of a fault (need fault hypothesis here)

Subcode Selection: Example
connect those that cannot be confused for f=1 then find largest fully connected subnet

Building fault tolerant DI codes (2)
Solution 2: apply an ED code to data then submit encoded data to DI encoding How to combine ED ode and DI code most efficiently?

The Assignment matters!
DI codewords: ED codewords: A single fault can change one DI codeword into the other! We have a 3-bit fault for the ED code to detect! For details on how to do better, see the paper!

Fault-Tolerant Architectures
Duplication & Comparison Triple-Modular Redundancy FU FU ERR vo-ter Y =? FU FU FU

Lock-Step Operation single clock FU vo-ter FU FU
„3“ „4“ vo-ter Y FU „3“ „4“ FU „3“ „4“ single point of failure good replica determinism

Lock-Step Operation independent clocks FU vo-ter FU FU
„3“ „4“ vo-ter Y FU „3“ „4“ FU „3“ „4“ single fault tolerant bad replica determinism

require explicit synchronization for voting
A Fundamental Choice single clock source retain synchrony achieve high precision avoid metastability issues multiple clk sources tolerate clock faults gain temporal redundancy require explicit synchronization for voting

Traditional TMR Architecture
is globally sync we want plesiochronous clocks accumulate phase shift => need to sync on voting cannot vote per cycle vote after „n“ cycles

Synchronizing the Voting
bypass voter normally => choose reasonable n dedicated recovery step feed back voted result to restore correct state error-free starting point for next n cycles

GALS-TMR: Details every nth clock cycle stop own clock
synchronous asynchronous every nth clock cycle stop own clock synchronize with others perform recovery step

Recovery Controller coordinates recovery of replica
operates when clock stops => asynchronous needs to be fault tolerant as well => distributed need internal replication to prevent local controllers from causing a deadlock as a consequence of internal transients transients on communication lines

What we could achieve… All units replicated, no single point of failure function blocks clocks recovery controllers Recovery from single transients Time safe solution (pausible clocks) no residual risk of upsets Function blocks stay synchronous

Summary GALS opens new options, if handled with care
Redundant clocks ultimately need synchronization Pausible clocks can synchronize without risk but cannot be made perfectly stable Building a safe glitch filter is tricky Making a DI coding FT is tricky as well Pausible clocks allow elegant GALS TMR solutions

Related Publications Designing Robust GALS Circuits with Triple Modular Redundancy, Jakob Lechner, ECCD 2012 Methods for Analysing and Improving the Fault resilience of Delay Insensitive Codes Jakob Lechner and Andreas Steininger, ICCD 2015, follow-up ICCD 2016 Equivalence of Clock Gating and Synchronization with Applicability to GALS Communication Robert Najvirt and Andreas Steininger, PATMOS 2014 How to synchronize a pausible Clock to a Reference Robert Najvirt and Andreas Steininger, ASYNC 2015 A versatile and reliable Glitch Filter for Clocks Robert Najvirt and Andreas Steininger, ECCTD 2015

Fault Tolerance in the Systems-on-Chip Era

Similar presentations

Presentation on theme: "Fault Tolerance in the Systems-on-Chip Era"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault Tolerance in the Systems-on-Chip Era

Similar presentations

Presentation on theme: "Fault Tolerance in the Systems-on-Chip Era"— Presentation transcript:

Similar presentations

About project

Feedback