Download presentation
Presentation is loading. Please wait.
Published byBrett Staker Modified over 9 years ago
1
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev
2
overview Motivation Current Techniques Proposed Mechanism for Online Fault Diagnosis Results Challenges Conclusion
3
Hard Faults Electron MigrationGate Oxide Breakdown background Transient Faults Single Event Upset
4
motivation Process Scaling
5
current fault handling techniques DIVA Redundancy
6
DIVA UTILIZE REDUNDANCY UTILIZE REDUNDANCY error detection and correction hybrid approach
7
online diagnosis Track Units DIVA ERROR deconfigure unit error_count++ If(error_count > threshold) YES NO No Action
8
ALU DIVA CHECKER Reorder Buffer Reservation Station Units that can be turned off in case of a fault Field Deconfigurable Units (FDU)
9
Deconfigure entries in circular bufferDeconfigure entries in tabular structure deconfiguring mechanism
10
Hard fault diagnosis latency Performance impact of losing component to hard fault analysis DIVA: 6% of an Alpha 21264 core Error counters (~1227 bits total) Instruction resource usage (19 wires in total) Deconfiguration logic Can be reduced using coarse granularity
11
challenges Error count threshold Related to resource usage Heavily used resources have higher counters Pipeline flushes before threshold is reached
12
challenges Error count threshold Related to resource usage Heavily used resources have higher counters Pipeline flushes before threshold is reached
13
Transient faults Independent resource usage ERROR HARD FAULT TRANSIENT FAULT ABC DEF Desired Observed DIVA CHECKER challenges
14
Certain structures cannot be protected Register File Issue logic Common Data Bus (CDB) Transient fault False Deconfiguration Possibly masked by error counter Faults in the error counter or deconfiguration logic Periodically test counters Permanently configure or deconfigure FDU upon error Window of vulnerability DIVA produces errors until counter saturates limitations
15
As transistors shrink, hard fault rate increases Current reliability mechanisms Redundancy (TMR) Thread level redundancy Pre shipment testing and deconfiguration Low cost solutions such as DIVA Online diagnosis Low cost and hardware overhead Use FDUs along with DIVA to diagnose faults dynamically Increase yield Binned to a lower performance bin conclusion
16
discussion What are the advantages of this hybrid scheme over using just a DIVA checker? As process technology gets smaller, can this mechanism help increase the lifetime of the processor a significant amount? As transistors shrink, the number of cores will increase, can this mechanism be used still as opposed to turning off a faulty core? How can we extend this mechanism to take care of the issue logic, singleton resources and CDB?
17
citations images Electron Migration. Digital image. Wikimedia.org. Wikimedia, 6 Mar. 2007. Web.. Gate Oxide Breakdown. Digital image. Attopsemi Technology. Attopsemi Technology, n.d. Web.. Sawant, Minal. Single Event Upset. Digital image. COTS. Microsemi, Jan. 2012. Web.. Sawant, Minal. Soft Error Rate. Digital image. CCCP. University of Michigan, 11 May 2012. Web.. Carr, Robert. Simultaneous Multithreading. Digital image. Prezi. Prezi, 31 Oct. 2013. Web.. Wong, William. Out of Order Pipeline. Digital image. Electronic Design. Electronic Design, 19 Oct. 2011. Web.. Mark Brehob, EECS 470 Lecture Slides Fred A. Bower, Daniel J. Sorin, and Sule Ozev. A Mechanism for Online Diagnosis of Hard Faults Microprocessors. In Proc. Of the 38 th Annual IEEE/ACM International Symposium on Microarchiteceture (MICRO’05), 2005 T.M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. Of the 32 nd Annual IEEE/ACM Int’l Symposium on Microarchitecture, pages 196-207, Nov. 1999. papers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.