Presentation is loading. Please wait.

Presentation is loading. Please wait.

® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.

Similar presentations


Presentation on theme: "® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce."— Presentation transcript:

1 ® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor 31st Annual International Symposium on Computer Architecture (ISCA), Munich, Germany, June 2004 Shubu Mukherjee 1 Christopher Weaver 1, Joel Emer 1, Steve Reinhardt 1,2 1 Massachusetts Microprocessor Design Center, Intel 2 University of Michigan, Ann Arbor

2 ® 2 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Outline Trade-off performance for lower soft error rate Trade-off performance for lower soft error rate ØMITF (mean instructions to failure) Øreduce errors by keeping objects longer in protected memory False Detected Unrecoverable Errors False Detected Unrecoverable Errors Øprocessor would unnecessarily crash on such an error Øtechniques to avoid false errors  (possibly incorrect) bit  (possibly incorrect) bit anti-  bitanti-  bit

3 ® 3 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Alpha or Neutron Particle Strike Changes State of a Single Bit 0 1

4 ® 4 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Silent Data Corruption (SDC) Bit Read? Bit has error protection? yes no detection & correction no benign fault no error detection only noyes no affects program outcome? benign fault no error SDC yes

5 ® 5 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel SDC Definitions SDC = Silent Data Corruption SDC = Silent Data Corruption MTTF = Mean Time to Failure MTTF = Mean Time to Failure ØSDC MTTF = time between two SDC events Chip SDC Rate (inversely  to MTTF) Chip SDC Rate (inversely  to MTTF) = Rate of occurrence of SDC events =  over all bits [ (Circuit Soft Error Rate) X (SDC AVF) ] Target market will typically set SDC budget Ø Ønote: budget is non-zero Circuit Soft Error Rate Circuit Soft Error Rate Ødetermined by alpha or neutron flux, circuit parameters, etc. AVF (Architectural Vulnerability Factor), Mukherjee, et al. MICRO, ‘03 AVF (Architectural Vulnerability Factor), Mukherjee, et al. MICRO, ‘03 Øfraction of strikes that affect program outcome ØAVF = 0% for branch predictor ØAVF = 100% for program counter ØAVF < 100% for instruction queue

6 ® 6 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Instruction Queue’s SDC AVF Similar to Mukherjee, et al., MICRO ‘03 CPU2000 Asim Simpoint Itanium®2-like

7 ® 7 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel SDC Reduction Techniques Chip SDC Rate Chip SDC Rate =  over all bits [ (Circuit Soft Error Rate) X (SDC AVF) ] Conventional techniques Conventional techniques Øprocess technology (e.g., fully-depleted SOI) Øcircuit technology (e.g., radiation-hardened cells) Øerror detection or correction codes (e.g., parity, ECC) Our new technique Our new technique Øreduce exposure to radiation to reduce SDC AVF –trade off between performance and soft error rate

8 ® 8 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel MITF = mean instructions to failure (work between two errors) MITF = # instructions committed # errors encountered IPC X (# cycles) # errors encountered = IPC X Total time X frequency # errors encountered = = IPC X MTTF X frequency IPC X frequency (Circuit Soft Error Rate) X AVF = IPC AVF = X frequency Circuit Soft Error Rate IPC AVF  MITF

9 ® 9 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Reducing SDC of an Instruction Queue (IQ) (assume protected instruction cache) Increase IPC: fetch aggressively from IC to IQ Increase IPC: fetch aggressively from IC to IQ Reduce SDC AVF: prevent instructions from sitting needlessly in IQ Reduce SDC AVF: prevent instructions from sitting needlessly in IQ Net benefit if we improve MITF (proportional to IPC / AVF) Net benefit if we improve MITF (proportional to IPC / AVF) IQ Fetch Decode Execute Commit Instruction Cache (IC) RR

10 ® 10 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Squash Instructions Goal Goal Ødon’t have instructions sit needlessly in the Instruction Queue Algorithm to Reduce Exposure to Radiation Algorithm to Reduce Exposure to Radiation ØTrigger: Cache Miss ØAction: Squash all instructions in instruction queue following the Load Miss Evaluation using Evaluation using ØAsim Performance Model Framework ØFirst 100 million instruction simpoint of all CPU2000 benchmarks ØItanium®2-like architecture, but scaled (note: in-order machine)

11 ® 11 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel SDC MITF Improvement from Reducing Exposure Design Point IPC SDC AVF MITF Improvement Baseline1.2129%0% Squash on L1 load misses 1.1922%37% IPC SDC AVF  SDC MITF

12 ® 12 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Outline Trade-off performance for lower soft error rate Trade-off performance for lower soft error rate ØMITF (mean instructions to failure) Øreduce errors by keeping objects longer in protected memory False Detected Unrecoverable Errors False Detected Unrecoverable Errors Øprocessor would unnecessarily crash on such an error Øtechniques to avoid false errors  (possibly incorrect) bit  (possibly incorrect) bit anti-  bitanti-  bit

13 ® 13 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Detected Unrecoverable Error (DUE) Bit Read? Bit has error protection? yes no detection & correction no no error benign fault no error detection only affects program outcome? True DUE False DUE noyes no affects program outcome? benign fault no error SDC yes no

14 ® 14 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel DUE Definitions DUE = Detected Unrecoverable Error DUE = Detected Unrecoverable Error MTTF = Mean Time to Failure MTTF = Mean Time to Failure ØDUE MTTF = time between two DUE events Chip DUE Rate (inversely  to MTTF) Chip DUE Rate (inversely  to MTTF) = Rate of occurrence of DUE events =  over all bits [ (Circuit Soft Error Rate) X (DUE AVF) ] Target market will typically set DUE budget Ønote: budget is non-zero Circuit Soft Error Rate Circuit Soft Error Rate Ødetermined by alpha or neutron flux, circuit parameters, etc. DUE AVF (Architectural Vulnerability Factor) DUE AVF (Architectural Vulnerability Factor) Øfraction of strikes that result in DUE events ØTotal DUE AVF = (True DUE AVF) + (False DUE AVF)

15 ® 15 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel DUE AVF of Instruction Queue with Parity False DUE AVF 33% CPU2000 Asim Simpoint Itanium®2-like

16 ® 16 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Total Soft Error Rate Total Soft Error Rate =  all bits [ (SDC Rate) + (DUE Rate) ] Total Soft Error Rate =  all bits [ (SDC Rate) + (DUE Rate) ] Parity converts SDC to DUE Parity converts SDC to DUE ØTrue DUE AVF (with error detection) = SDC AVF (without detection) Parity also introduces False DUE Parity also introduces False DUE Øe.g., error flagged on wrong-path or dynamically dead instruction Parity-protecting a bit increases overall observed soft error rate Parity-protecting a bit increases overall observed soft error rate ØExample: instruction queue ØSDC AVF (without error detection) = 29% ØDUE AVF (with error detection) = 62% –True DUE AVF = 29% –False DUE AVF = 33% –Idle & miscellaneous = 38%

17 ® 17 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Reducing DUE Chip DUE Rate =  over all bits [ (Circuit Soft Error Rate) X (DUE AVF) ] DUE AVF = (True DUE AVF) + (False DUE AVF) Techniques Techniques Øconvert back to SDC Øprocess technology (e.g., fully-depleted SOI) Øcircuit technology (e.g., radiation-hardened cells) Øerror recovery techniques (e.g., ECC) Our new techniques Our new techniques Øexposure reduction techniques (first part of this talk) ØFalse DUE AVF reduction

18 ® 18 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Sources of False DUE in an Instruction Queue Instructions with uncommitted results Instructions with uncommitted results Øe.g., wrong-path, predicated-false Ø solution:  (possibly incorrect) bit till commit Instruction types neutral to errors Instruction types neutral to errors Øe.g., no-ops, prefetches, branch predict hints Øsolution: anti-  bit Dynamically dead instructions Dynamically dead instructions Øinstructions whose results will not be used in future Øsolution:  bit beyond commit

19 ® 19 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Coping with Wrong-Path Instructions (assume parity-protected instruction queue) DECLARE ERROR ON ISSUE Problem: not enough information at issue IQ Fetch Decode Execute Commit Instruction Cache (IC) Data Cache RR inst X

20 ® 20 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel The  (Possibly Incorrect) Bit (assume parity-protected instruction queue)  bit is set At commit point, declare error only if not wrong-path instruction and  bit is set IQ Fetch Decode Execute Commit Instruction Cache (IC) Data Cache RR inst  POST ERROR IN  BIT ON ISSUE  ) inst (  )

21 ® 21 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel anti-  bit: coping with No-ops (assume parity-protected instruction queue)  bit is set, then do not set the  bit On issue, if the anti-  bit is set, then do not set the  bit IQ Fetch Decode Execute Commit Instruction Cache (IC) Data Cache RR inst  ) (anti-  )  ) inst (anti-  ) inst  bit neutralizes the  bit anti-  bit neutralizes the  bit inst

22 ® 22 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel  bit: avoiding False DUE on Dynamically Dead Instructions IQ Fetch Decode Execute Commit Instruction Cache (IC) Data Cache RR write R1 write R1(  ) read R1 read R1 (  ) Declare the error on reading R1, if  bit is set If R1 isn’t read (i.e., dynamically dead), then no False DUE  bit can be used in caches & main memory …

23 ® 23 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Scope of the  Bit  bit allows declaring an error on use of a value or object  bit allows declaring an error on use of a value or object Ørather than when the error is detected Øe.g., declare error on register read, rather when it was detected  bit goes out of scope  bit goes out of scope Øwhen error information cannot be propagated Øe.g., store writes data into cache without  bits Øtypically, raise error when  bit goes out of scope Design points: increasing levels of  bit protection Design points: increasing levels of  bit protection Ø  bit till register commit Ø  bit till register read Ø  bit till store commit Ø  bit till I/O commit

24 ® 24 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel % False DUE AVF Eliminated (PI =  ) Practical to eliminate most of the False DUE AVF CPU2000 Asim Simpoint Itanium®2-like

25 ® 25 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Summary Trade-off performance for lower soft error rate Trade-off performance for lower soft error rate ØMITF (mean instructions to failure)  (IPC / AVF) Øreduce errors by keeping objects longer in protected memory False Detected Unrecoverable Errors False Detected Unrecoverable Errors Øprocessor would unnecessarily crash on such an error Øtechniques to avoid false errors  (possibly incorrect) bit  (possibly incorrect) bit anti-  bitanti-  bit PET (post-commit error tracking) buffer, see paperPET (post-commit error tracking) buffer, see paper

26 ® 26 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel BACKUP SLIDES FOLLOW

27 ® 27 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel % of False DUE Covered Possible to eliminate most of the False DUE AVF


Download ppt "® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce."

Similar presentations


Ads by Google