Presentation is loading. Please wait.

Presentation is loading. Please wait.

Raghuraman Balasubramanian Karthikeyan Sankaralingam Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution.

Similar presentations


Presentation on theme: "Raghuraman Balasubramanian Karthikeyan Sankaralingam Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution."— Presentation transcript:

1 Raghuraman Balasubramanian Karthikeyan Sankaralingam Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution

2 PERSim 2 A framework with unprecedented fidelity in studying the end-to-end physical effects of reliability at the gate-level while running entire programs. /29 2

3 Understanding Reliability 3 Wearout Particle Strikes Permanent faults Change in gate behavior How does reliability affect the processor over its lifetime? What happens to the user application? Logic Fault Architecture Errors /29 3

4 State-Of-The-Art 4 Device-level models that capture the effects of reliability physics Simulators capable of running full programs Wearout Particle Strikes Permanent faults Change in gate behavior Logic Fault Architecture Errors What happens to the user application? /29 4

5 PERSim 5 A framework with unprecedented fidelity in studying the end-to-end physical effects of reliability at the gate-level while running entire programs. Device-level models that capture the effects of reliability physics Simulators capable of running full programs /29 5

6 Why is this important? 6 Device level problems ➔ Microarchitectural / Architectural level. Evaluations ➔ A small structured hardware, abstracted physics /29 6 M.Agarwal,B.Paul,M.Zhang,andS.Mitra. Circuit failure prediction and its application to transistor aging. In VTS ’07. T. M. Austin. Diva: A reliable substrate for deep submicron microarchitecture design. In MICRO ’99. R. Balasubramanian and K. Sankaralingam. Virtually aged sampling dmr: Unifying circuit failure detection and circuit failure prediction. MICRO ’13. J.Blome,S.Feng,S.Gupta,andS.Mahlke. Self-calibrating online wearout detection. In MICRO ’07. K. Bowman, J. Tschanz, C. Wilkerson, S. Lu, T. Karnik, V. De, and S. Borkar. Circuit techniques for dynamic varia- tion tolerance. In DAC ’09. K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco. Software-based online detection of hardware defects mechanisms, architectural support, and evaluation. In MICRO ’07. M. de Kruijf, S. Nomura, and K. Sankaralingam. Relax: An architectural framework for software recovery of hardware faults. In ISCA ’10. Ernst et al. Razor: A low-power pipeline based on circuit- level timing speculation. In MICRO ’03. Gherman, J. Massas, S. Evain, S. Chevobbe, and Y. Bonhomme. Error prediction based on concurrent self-test and reduced slack time. DATE ’11. D. Gizopoulos, M. Psarakis, S. V. Adve, P. Ramachandran, S. K. S. Hari, D. Sorin, A. Meixner, A. Biswas, and X. Vera. Architectures for online error detection and recovery in mul- ticore processors. In DATE ’11. M.Gomaa,C.Scarbrough,T.N.Vijaykumar,andI.Pomeranz. Transient-fault recovery for chip multiprocessors. In ISCA ’03. A. Meixner, M. E. Bauer, and D. J. Sorin. Argus: Low-cost, comprehensive error detection in simple cores. In MICRO ’07. S. Nomura, M. D. Sinclair, C.-H. Ho, V. Govindaraju, M. de Kruijf, and K. Sankaralingam. Sampling+dmr: practical and low-overhead permanent fault detection. In ISCA’11. J. Park and J. Abraham. An aging-aware flip-flop design based on accurate, run-time failure prediction. In VTS ’12. M. Prvulovic, Z. Zhang, and J. Torrellas. Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. ISCA ’02. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. Enerj: Approximate data types for safe and general low-power computation. In PLDI ’11. S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra low- cost defect protection for microprocessor pipelines. In ASPLOS ’06. C.Smolens,B.T.Gold,J.C.Hoe,B.Falsafi,andK.Mai. Detecting emerging wearout faults. In SELSE ’07. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In ISCA ’02. X. Wang, D. Tran, S. George, L. Winemberg, N. Ahmed, S. Palosh, A. Dobin, and M. Tehranipoor. Radic: A standard-cell-based sensor for on-chip aging and flip-flop metastability measurements. In ITC ’12. B. Zandian, W. Dweik, S. H. Kang, T. Punihaole, and M. Annavaram. Wearmon: Reliability monitoring using adaptive critical path testing. In DSN ’12.

7 Executive Summary 7 PERSIM Model physical effects at gate-level See impact running full programs on a full processor At high simulation speeds (25 million cycles per second) With good signal observability Fine grain control on fault injection Demonstration Evaluation of 4 recently proposed techniques End-to-end transient fault analysis /29 7

8 Outline 8 Motivation and Overview How do we do it Motivating running example Breaking it down into mechanisms Implementation Evaluations using PERSim Questions /29 8

9 Virtually Aged Sampling DMR Virtual Aging Fault Exposure In most gates the faults are automatically exposed A new mechanism to expose faults in other gates Detect Errors 9 9 /29

10 Evaluation using PERSim 10 Wearout Delay as a function of Time/V dd Delay Aware Simulation Applications Input Sequences Applications DMR Error?? Fault Vector /29 10

11 Mechanisms - Fault Modeling 11 Delay Aware Simulation Applications Input Sequences Applications DMR Error?? Fault Vector Fault Modeling Wearout Delay as a function of Time/V dd /29 11

12 Mechanisms – Delay Aware Simulation 12 Delay Aware Simulation Applications Input Sequences Applications DMR Error?? Fault Vector Fault Modeling Wearout Delay as a function of Time/V dd /29 12

13 Mechanisms – Input Sequence Extraction 13 Delay Aware Simulation Applications Input Sequences Applications DMR Error?? Fault Vector Fault Modeling Wearout Delay as a function of Time/V dd Input Sequence Extraction /29 13

14 Mechanisms – Fault Injection 14 Delay Aware Simulation Applications Input Sequences Applications DMR Error?? Fault Vector Fault Modeling Wearout Delay as a function of Time/V dd Input Sequence Extraction Fault Injection & Deterministic Re-execution /29 14

15 Mechanisms 15 Input Sequence Extraction Delay Aware Simulation Fault Modeling Fault Injection and Deterministic Re-execution /29 15

16 Outline 16 Motivation and Overview How do we do it Implementation Input Sequence Extraction & Fault Injection Delay Aware Simulation Fault Modeling Evaluations using PERSim Questions /29 16

17 Implementation 17 Input Sequence Extraction Fault Injection and Deterministic Re-execution Be able to run full programs /29 17

18 Input Sequence Extraction 18 Input Sequence Extraction Fault Injection and Deterministic Re-execution Be able to observe signals on a cycle-by-cycle basis /29 18

19 Fault injection 19 Input Sequence Extraction Fault Injection and Deterministic Re-execution Fine grain control for fault injection /29 19

20 Implementation 20 Input Sequence Extraction Fault Injection and Deterministic Re-execution Automate – run multiple tests at the push of a button /29 20

21 Implementation 21 25 million cycles per second per board Full SPEC benchmarks /29 21

22 Fault Modeling 22 Reliability phenomena ⇒ behavior of gates? Wearout Synopsys HSPICE+MOSRA /29 22

23 Fault Modeling 23 Reliability phenomena ⇒ behavior of gates? Transient Faults Charge Accumulation Model /29 23

24 Fault Modeling 24 Reliability phenomena ⇒ behavior of gates? Permanent Faults Probabilistic Models /29 24

25 Implementation Delay Aware Simulation Applications Input Sequences Applications DMR Error?? Fault Vector Fault Modeling Wearout : HSPICE+MOSRA Transient Faults :Charge Accumulation Model Permanent Faults : Probabilistic models /29 25

26 Outline 26 Motivation and Overview How do we do it Implementation Evaluations using PERSim Reliability Techniques Key Results Questions /29 26

27 Reliability Techniques 27 Circuit failure prediction (Wearout) FIRST [Smolens et al., SELSE’07] WearMon [Zandian et al., DSN’ 12] Online Wearout Prediction [Blome et al., MICRO’07] Transient Fault Analysis Gate level modeling of particle strike Application level impact analysis Permanent Fault Detection Sampling-DMR [Nomura et al., ISCA’11] /29 27

28 Key Results 28 /29 28 Circuit failure prediction (Wearout) Gates in non-critical paths not covered. PERSim enables full processor coverage Modeling Hole Covered Transient Fault Analysis Accurate modeling of particle strikes on individual gates ⇒ impact on full programs Cross-layer transient fault analysis Permanent Fault Detection Cycle-by-cycle error traces running full programs Fine-grained signal visibility

29 Executive Summary 29 PERSIM Model physical effects at gate-level See impact running full programs on a full processor At high simulation speeds (25 million cycles per second) With good signal observability Fine grain control on fault injection Demonstration Evaluation of 4 recently proposed techniques End-to-end transient fault analysis /29 29 www.persim.org

30 Backup slides 30 /29

31 Limitations 31 Related faults Interaction between faults not captured OpenRISC Simple, in-order processor Limited online visibility on current state/program progress ZedBoard memory footprint/Zynq FPGA size Determinism Requires careful manipulation of programs /29


Download ppt "Raghuraman Balasubramanian Karthikeyan Sankaralingam Understanding the Impact of Gate-Level Physical Reliability Effects on Whole Program Execution."

Similar presentations


Ads by Google