Download presentation
Presentation is loading. Please wait.
1
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable Systems) Intel Corporation Workshop on Computer Architecture Research Directions (CARD) Feb. 11 th, 2007
2
2 What’s the Truth? There are three versions of the truth: My truth Your truth The truth
3
3 The Truth: Silicon is Becoming Unreliable Time dependent device degradation Extreme device variations Wider Cell Instability Is Increasing Soft Error FIT/Chip (Logic & Mem)
4
4 The End-User’s Truth End-users Care deeply about reliable systems May not be able to determine why their system failed Expect the industry to produce reliable systems for them Goal of silicon vendors Keep # silicon errors low enough (e.g., < 0.1% of all errors) Low enough that end-users don’t notice or don’t care Point Risks Individual corruption or crash may be critical (e.g., Windows 98 crash during a Gates demo) End-users may demand chip replacement, even if the error was not permanent
5
5 The IT Manager or Vendor’s Truth The Lightbulb Phenomenon A house with 48 lightbulbs, each with 4 year MTTF Will replace a lightbulb every month Negative Impact to Business billions of dollars involved Increased total cost of ownership Product returns & replacement Loss of data and/or availability
6
6 The Designer’s Awakening Shock “SER is the crabgrass in the lawn of computer design” Denial “We will do the SER work two months before tapeout” Anger “Our reliability target is too ambitious” Acceptance “You can deny physics only for so long” Designers have accepted silicon reliability as a challenge they will have to deal with
7
7 The Designer’s Challenge Protection comes from Process – improved process technology Materials – shielding for alpha particles Circuits – rad-hard cells Architecture – ECC, parity, hardened gates, redundant execution Software – can provide detection & recovery at higher level Companies constantly making trade-offs for reliability Cost of protection (performance & die size) vs. chip reliability Products must meet the end-users reliability expectations Industry will produce reliably operating parts
8
8 Industry Needs Help with Research Academia has some misconceptions MTBF is only a rough estimate of an individual parts life A system hang does not protect from data corruption Adding protection without correction does not reduce the overall error rate … Research needed in different areas of silicon reliability How do we predict and/or measure error rate from radiation, wearout, & variability? How do we detect soft errors, wearout, variability on individual parts? Many traditional solutions exist, but how do we make them cheaper?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.