Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Similar presentations


Presentation on theme: "Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection."— Presentation transcript:

1 Reliability

2 Threads for Fault Tolerance zMultiprocessors: yTransient fault detection

3 Transient Faults zFaults that persist for a “short” duration zCause: cosmic rays, energetic particles originating from outer space zEffect: knock off electrons, discharge capacitor zSolution yno practical absorbent for cosmic rays x1 fault per 1000 computers per year (estimated fault rate) zFuture is worse ysmaller feature size, higher transistor count, reduced noise margin

4 Background zFault tolerant systems use redundancy to improve reliability: yTime redundancy: separate executions ySpace redundancy: separate physical copies of resources xDMR/TMR yData redundancy xECC: Automatic repeat request (ARQ), Forward error correction (FEC) xParity: odd/even zExamples: yIBM: duplicated pipelines, spare processors, ECC in memories... yHP: DMR/TMR processors, Parity/ECC in buses, memories...

5 Multiprocessors: Fault Detection zChip-level Redundantly Threaded processor yReplicates register values but not memory values yThe leading thread commits stores only after checking xMemory is guaranteed to be correct xOther instructions commit without checking yThe leading thread sends committed values for: xbranch outcomes xload/store values xstore addresses

6 Sphere of Replication (SoR) zLogical boundary of redundant execution within a system yComponents within protected via redundant execution yComponents outside must be protected via other means zIts size matters: yError detection latency yStored-state size

7 Example Spheres of Replication Compaq Himalaya ORH-Dual: On-Chip Replicated Hardware (similar to IBM G5)

8 Fault Detection in Compaq Himalaya System Replicated Microprocessors + Cycle-by-Cycle Lockstepping

9 Fault Detection via Simultaneous Multithreading (SMT) Replicated Microprocessors + Cycle-by-Cycle Lockstepping

10 Concept zSMT improves the performance of a processor by: yallowing independent threads to execute simultaneously ydoing so in different functional units zRedundant Multithreading (RMT): yleverages SMT’s properties to allow fault detection for microprocessors xruns two copies of the same program as independent threads xcompares their outputs and initiates recovery in case of mismatch

11 Input Replication zLoad Value Queue (LVQ) yKeep threads on same path despite I/O or MP writes yOut-of-order load issue possible

12 Output Comparison Compare & validate output before sending it outside the SoR

13 Store Queue Comparator (STQ) zStore Queue Comparator yCompares outputs to data cache yCatch faults before propagating to rest of system

14 Store Queue Comparator (cont’d) zExtends residence time of leading-thread stores ySize constrained by cycle time goal yBase CPU statically partitions single queue among threads yPotential solution: per-thread store queues zDeadlock if matching trailing store cannot commit ySeveral small but crucial changes to avoid this

15 Branch Outcome Queue (BOQ) zBranch Outcome Queue yForward leading-thread branch targets to trailing fetch y100% prediction accuracy in absence of faults

16 Simultaneous & Redundantly Threaded Processor (SRT) zSRT = SMT + Fault Detection zLess hardware compared to replicated microprocessors ySMT needs ~5% more hardware over uniprocessor ySRT adds very little hardware overhead to existing SMT zBetter performance than complete replication ybetter use of resources yLower cost

17 Issues zCycle-by-cycle output comparison and input replication: yEquivalent insts from different threads may execute in different cycles yEquivalent insts from different threads might execute in different order zPrecise scheduling of the threads crucial for optimal performance zBranch misprediction zCache miss


Download ppt "Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection."

Similar presentations


Ads by Google