Hwisoo So. , Moslem Didehban#, Yohan Ko

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
Multiscalar processors
Cost-Efficient Soft Error Protection for Embedded Microprocessors
1 Enhancing Random Access Scan for Soft Error Tolerance Fan Wang* Vishwani D. Agrawal Department of Electrical and Computer Engineering, Auburn University,
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Adopting Multi-Valued Logic for Reduced Pin-Count Testing Baohu Li, Bei Zhang and Vishwani Agrawal Auburn University, ECE Dept., Auburn, AL 36849, USA.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
CML CML Compiler-Managed Protection of Register Files for Energy-Efficient Soft Error Reduction Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
CML CML Compiler Optimization to Reduce Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture Lab Department of Computer.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Software Coherence Management on Non-Coherent-Cache Multicores
Multiscalar Processors
Computer Architecture: Multithreading (III)
nZDC: A compiler technique for near-Zero silent Data Corruption
Fault Tolerance In Operating System
InCheck – An Integrated Recovery Methodology for nZDC
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Splitting Functions in Code Management on Scratchpad Memories
NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Daya S Khudia, Griffin Wright and Scott Mahlke
Fault Injection: A Method for Validating Fault-tolerant System
NVIDIA Fermi Architecture
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Jian Cai, Aviral Shrivastava Presenter: Yohan Ko
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
InCheck: An In-application Recovery Scheme for Soft Errors
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
Programming with Shared Memory Specifying parallelism
Fault Tolerant Systems in a Space Environment
Software Techniques for Soft Error Resilience
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Sculptor: Flexible Approximation with
University of Wisconsin-Madison Presented by: Nick Kirchem
Presentation transcript:

EXPERT: Effective and Flexible Error Protection by Redundant Multithreading Hwisoo So*, Moslem Didehban#, Yohan Ko*, Aviral Shrivastava#, Kyoungwoo Lee* *Department of Computer Science, Yonsei University, Seoul, Korea #Compiler Microarchitecture Lab, Arizona State University, Tempe, AZ Presented by Hwisoo So

EXPERT: Effective and Flexible Error Protection by Redundant Multithreading Background and motivation Problem: vulnerability in previous redundant multithreading (RMT) EXPERT: an improved RMT Experiments and conclusion

Soft and hard error: main threats to reliability Now, reliability is one of the most important design concerns Main sources of hardware unreliability Soft error, aka transient fault Hard error, aka permanent fault Photo-illustration: iStockphoto 16 November 2018 Hwisoo So / Yonsei University

Redundant Multithreading: flexible and effective Software-level redundancy: flexible error detection No hardware modification Can provide flexibility Redundant multithreading: effective software-level detection Main approaches of software-level redundancy are Instruction-level redundancy Redundant multithreading Soft error Can detect Hard error Can not detect Controlflow Difficult to detect 16 November 2018 Hwisoo So / Yonsei University

Previous RMT researches SRMT: software-based redundant multithreading [Wang, CGO ‘07] COMET[Mitropoulou, Cases ‘16], DAFT[Zhang, IJPP ‘12]: Improves runtime [Wadden, ISCA ‘14][Gupta, DAC ‘17]: Applies SRMT to GPU RedThreads[Hukerikar, IJPP ‘16]: Programmer-tunable SRMT for HPC Leading thread Trailing thread Data Memory Identical computation Memory operation Checking values for memory operation 16 November 2018 Hwisoo So / Yonsei University

Hwisoo So / Yonsei University Experiment: Setup Benchmark: 9 applications in MiBench Original / SRMT-protected Without hardware supports for inter-thread communication Fault Injection on cycle-accurate gem5 simulator 6 components for fault injection 1 error injection per 1 execution 500 soft errors and 100 hard errors per each component / benchmark Fault coverage validation Main target: # of silent data corruption With correction factor[Schirmeier, DSN ‘15] (# of SDCs * runtime * # of cores) 16 November 2018 Hwisoo So / Yonsei University

Experiment: error coverage of SRMT Total: 27,000 soft error and 5,400 hard error injection For unprotected and SRMT-protected application On average, SRMT requires ~3.9x runtime 2 cores are used for physically separated multithreading 16 November 2018 Hwisoo So / Yonsei University

Why SRMT suffers vulnerability? SRMT checking only checks old snapshot of registers Incorrect execution of memory operation can be undetected Vulnerable input replication & vulnerable output comparison Leading thread Trailing thread Communication Queue Data Memory #1: Send addr #1: Checking #1: Load Address of #1 #1: Send result #1: Copying result Result of #1 #2: Send addr, data #2: Checking Address of #3 Corrupted #2: Store Data of #3 16 November 2018 Hwisoo So / Yonsei University

EXPERT: Reliable software-level RMT Identical computation Main Thread Checker Thread ① Load data  [addr] Load data*  [addr*] Data Memory Data for load ② waits until checker reaches ① ② Store data → [addr] result of store Corrupted Store data* → [addr*] Store data* → [addr*] ③ waits until ② is done Load temp*  [addr*] (temp*result of store) Check temp*, data* Load temp*  [addr*] (temp*result of store) Check temp*, data* ③ 16 November 2018 Hwisoo So / Yonsei University

EXPERT: Store Packing Optimization 1❶❷❸❹①②③④⑤⑥⑦ EXPERT: Store Packing Optimization Main Thread Checker Thread 2-way sync for every store ~7.2x runtime on average If there is no dependency between ①, ②, and ③ Expert checking needs to keep “Store Packing” is possible If there is no memory dependency for both STORE and LOAD ~43% performance improvement Wait Store Notify Notify Wait Check ① ❶ ② ❷ ① ❶ ② ❷ ③ ❸ ③ ❸ 16 November 2018 Hwisoo So / Yonsei University

Hwisoo So / Yonsei University Experiment: Setup Benchmark: 9 applications in miBench Original / SRMT-protected / EXPERT-protected Fault Injection on cycle-accurate gem5 simulator 6 components for fault injection 1 error injection per 1 execution 500 soft errors and 100 hard errors per each component / benchmark Total # of injections : 81,000 soft errors & 16,200 hard errors Fault coverage validation Main target: # of silent data corruption With correction factor[Schirmeier, DSN ‘15] (# of SDCs * runtime * # of cores) 16 November 2018 Hwisoo So / Yonsei University

Experiment: SDC coverage validation 7,061 (21.79%) 1,310 (4.04%) 20 (0.062%) Normalized Number of SDCs (log scale) soft hard total Original : 6638 / 423 / 7061 SRMT: 1145 / 165 / 1310 EXPERT: 20 / 0 / 20 16 November 2018 Hwisoo So / Yonsei University

Hwisoo So / Yonsei University Conclusion Improved soft and hard error detection With load-back checking & load replication on redundant multithreading Additional sync scheme is needed 65x better SDC coverage compared to SRMT Limitations Runtime becomes ~5.0x on average, even with sync optimization, SRMT: 3.9x on average Can be improved with hardware support for communication SDC cases on silent store https://www.date-conference.com/av-guidelines/ A slide containing the conclusion of your talk 16 November 2018 Hwisoo So / Yonsei University

Hwisoo So / Yonsei University References [Wang, CGO ‘07] C. Wang et al., “Compiler-managed software-based redundant multi-threading for transient fault detection,” in CGO, 2007. [Mitropoulou, Cases ‘16] K. Mitropoulou et al., “Comet: communication- optimised multithreaded error-detection technique,” in CASES. ACM, 2016. [Zhang, IJPP ‘12] Y. Zhang et al., “DAFT: Decoupled Acyclic Fault Tolerance,” International Journal of Parallel Programming, 2012. [Wadden, ISCA ‘14] J.Wadden et al., “Real-world design and evaluation of compilermanaged gpu redundant multithreading,” in ISCA. IEEE, 2014. [Gupta, DAC ‘17] M. Gupta et al., “Compiler techniques to reduce the synchronization overhead of gpu redundant multithreading,” in DAC, 2017. [Hukerikar, IJPP ‘16] S. Hukerikar et al., “Redthreads: An interface for applicationlevel fault detection/correction through adaptive redundant multithreading,” IJPP, 2016. [Schirmeier, DSN ‘15]] H. Schirmeier et al., “Avoiding pitfalls in fault-injection based comparison of program susceptibility to soft errors,” in DSN, 2015. https://www.date-conference.com/av-guidelines/ A slide containing the conclusion of your talk 16 November 2018 Hwisoo So / Yonsei University

Hwisoo So / Yonsei University Extra slides https://www.date-conference.com/av-guidelines/ A slide containing the conclusion of your talk 16 November 2018 Hwisoo So / Yonsei University

Soft error and hard error Soft error: temporal bit flip Hard error: permanent bit fault Soft error occurs while executing #1 #1 = + R0 R1 2 R2 4 Adder R0 6 → 7 #2 = + R3 R4 4 R5 4 R3 8 This adder always make last bit of result as 1 #1 = + R0 R1 2 R2 4 Adder R0 6 → 7 #2 = + R3 R4 4 R5 4 R3 8 → 9

SRMT: Error cases Load in SRMT-protection Store in SRMT-protection Leading thread Trailing thread Data Memory Fine Load data  [addr] Load data  [addr] Check addr, addr* Copy data* ← data Load Leading thread Trailing thread Data Memory Fine Check addr, addr* Check data, data* Store Store data → [addr] Corrupted 16 November 2018

EXPERT: Removing Vulnerability from LOAD Replicating load operation on checker thread Main Thread Main Thread Data Memory Checker Thread load data[addr] load data*[addr*] NOTE: Checker thread access memory with its local register Soft error on load operation can only corrupt one thread System can detect mismatch, as another thread is clean Checking for load operation is not necessary Only store operation can propagate error effect Mismatch will be found on later checking for store operation 16 November 2018 Audio/Visual Template

EXPERT: Load-back checking against error If error corrupts data of store operation If error corrupts address of store operation Main Thread Checker Thread Store data  [addr] Wrong result Load temp*  [addr*] Cmp temp*, data* Data Memory Main Thread Checker Thread Store data  [addr] Not Updated Load temp*  [addr*] Cmp temp*, data* Data Data Memory 16 November 2018 Audio/Visual Template

Hwisoo So / Yonsei University Silent Store Problem Silent store: if previous value in memory is same to data of store, store does not change memory If address of silent store is corrupted, EXPERT can not detect memory corruption Main Thread Checker Thread Store data  [addr] Same to data Load temp*  [addr*] Cmp temp*, data* Data Data Memory 이전 글자들을 제거한 대신, 싱크로 얘기를 할것임 스토어 / 로드에서 에러가 발생하는 예시도 보여주자 16 November 2018 Hwisoo So / Yonsei University

EXPERT: Memory Coherence Problem In LOAD and STORE with same address In STORE and relative CHECKING Load R0[R4] R1 = R0 + 4 Store R1[R4] Load R0*[R4*] R1* = R0 * + 4 DO CHECKING Main Thread Checker Thread Data Memory 1000 ① 1000 ③ 1000 1004 Not Done 1004 ② Store R1[R4] (R1 = 1004) Load Temp*[R4*] Cmp temp*, data* (data* = 1004) Main Thread Checker Thread Data Memory 1000 1004 ② Not Done ① 1000 1004 16 November 2018 Audio/Visual Template

2-ways of Compiler-level error detection In-thread replication Redundant multithreading data = data + 4 Original Code Thread 0 Thread 1 data = data + 4 data* = data* + 4 data = data + 4 data* = data* + 4 Replicates instructions Replicates execution thread Adder on core i Mismatch can not be detected Mismatch can be detected data + 4 data* + 4 Wrong result Correct result Wrong result data + 4 Wrong Result Mismatch can be detected data* + 4 Adder on core j Correct result Adder on core i 16 November 2018 Audio/Visual Template