Download presentation
Presentation is loading. Please wait.
Published byGabriela Holbrook Modified over 9 years ago
1
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 318693 Gulay Yalcin, Anita Sobe, Alexey Voronin, Jons-Tobias Wamhoff, Derin Harmanci, Adrián Cristal, Osman Unsal, Pascal Felber, Christof Fetzer PDP2014, Turin, Italy 13 February 2014 Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margin
2
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Dark Silicon Phenomenon Number of transistors can be increased. In order to stay within a chip’s power budget, some must remain “dark”. One solution: Downscale the voltage. 2
3
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin How about Reliability? POWERRELIABILITYPERFORMANCE 3 When the V dd is reduced, the error rate increases exponentially [1]. [1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003 Our goal is: Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption.
4
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Agenda / Overview Motivation Experiment: Scaling V dd in a Real System Basics of Reliability Error Recovery with TM Error Detection Schemes Analysis Conclusion 4
5
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Reducing V dd in a Real System 5 AMD FX-6100 6-core CPU CPU-heavy execution Every 10 seconds reduce Vdd by 12.5mV Monitor Incorrect Result System Crash Machine Check Architecture The system encounters errors which can not be corrected by MCA even only after 10% reduction in V dd Errors are in instruction cache (37%), execution unit (61%) and others (less than 2%).
6
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Basics of Reliability 6 Error Recovery Global Checkpointing Coordinated Local Checkpointing Un-coordinated Local Checkpointing Error Detection Replication Assertions/Invariants Symptom-Based Encoded Processing Transactional Memory can provide a lightweight Coordinated Local Checkpoitning [2] [2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory, DATE 2013
7
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin TM provides checkpointing/rollback 7 Processor 1 Checkpoint (Log Area) P2 P3 P4 Pn TM write-sets log the tentative memory updates. Synchronize checkpoints Data-Versioning provides a synchronization mechanism between checkpoints.
8
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Error Detection Schemes - Replication Execute instruction streams multiple times Compare the results of executions Less comparison with TM. Dual/Triple Modular Redundancy + High Error Detection Rate - High Energy Overhead 8
9
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Error Detection Schemes-Assertions/Invariants Assertions: Conditions referring to the current and previous state of the program. Check the state Adding manually or automatic TM facilitates inserting invariants Ex: 9
10
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Error Detection Schemes - Symptoms Monitor program executions to inspect if there is a symptom of hardware faults. Symptoms: Mispredictions in high confidence branches, high OS activity, fatal traps (e.g. undefined instruction code) Reliability at a low cost 10
11
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Error Detection Schemes- Encoded Processing Apply software coding (ECC-like) techniques The redundancy is added by applying arithmetic codes to the values. Arithmetic codes: AN, ANBDmem etc. With TM, the validation of a code word can be deferred until a TX commits. Ex: 11
12
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Comparing Error Detection Schemes 12
13
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Analysis Gem5 full system simulator 1GHz in-order cores 4 cores X86 ISA 64KB L1 data and instruction caches Unified 2MB L2 cache SPLASH2 benchmark suite. 13
14
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Energy Analysis 14 E ≈ C x V dd 2 V dd Error-free Overhead Recovery Overhead Fault Injection TX size Error Detection Rate
15
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Energy Reduction 15
16
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Reliability of the System 16
17
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Conclusion The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection. 17
18
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Future Work: Combining DMR and Symptoms 18
19
Combining Error Detection and TM for Energy-Efficient Computing below Safe Operation Margin Thanks! 19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.