CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Chapter 8. Pipelining.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz ŞEN 26/04/2007.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 12 Pipelining Strategies Performance Hazards.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Cache Organization of Pentium
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
Revisiting Load Value Speculation:
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.
Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Hyper-Threading Technology Architecture and Microarchitecture
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Fall 2012 Parallel Computer Architecture Lecture 13: Multithreading III Prof. Onur Mutlu Carnegie Mellon University 10/5/2012.
Multiscalar Processors
Computer Architecture: Multithreading (III)
/ Computer Architecture and Design
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Computer Architecture Lecture 4 17th May, 2006
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
Lecture: Static ILP, Branch Prediction
Lecture: Cache Innovations, Virtual Memory
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
/ Computer Architecture and Design
Chapter 8. Pipelining.
/ Computer Architecture and Design
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Handling Stores and Loads
Presentation transcript:

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter

CS SMT + Fault Tolerance Papers Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.

CS717 3 Outline 1.Background SMT Hardware fault tolerance 2.AR-SMT Basic mechanisms Implementation issues Simulation and Results 3.Transient Fault Detection via SMT Sphere of replication Basic mechanisms Comparison to AR-SMT Simulation and Results 4.Redundant Multithreading Alternatives Realistic processor implementation CRT Simulation and Results 5.Fault Recovery 6.Next Lecture

CS717 4 Transient Fault Detection via SMT More detailed analysis of Simultaneous and Redundant Threading (SRT) Introduces Sphere of Replication concept Explores SRT design space Discussion of input replication Architecture for output comparison Performance improving mechanisms More depth in simulation

CS717 5 Sphere of Replication Components inside sphere are protected against faults using replication External components must use other means of fault tolerance (parity, ECC, etc.) Inputs to sphere must be duplicated for each of the redundant processes Outputs of the redundant processes are compared to detect faults Simple to understand in lockstepping Larger sphere –more state to replicate –less input replication and output comparison

CS717 6 Sphere of Replication (part 2) Size of sphere of replication –Two alternatives – with and without register file –Instruction and data caches kept outside

CS717 7 Input Replication Must ensure that both threads received same inputs to guarantee they follow the same path Instructions – assume no self-modification Cached load data –Out-of-order execution issue –Multiprocessor cache coherence issues Uncached load data – must synchronize External interrupts –Stall lead thread and deliver interrupt synchronously –Record interrupt delivery point and deliver later

CS717 8 Cached Load Data - ALAB Active Load Address Buffer (ALAB) –Delays cache block replacement or invalidation –ALAB is table with address tag, counter, and pending-invalidate bit –Counter tracks trailing thread’s outstanding loads –Blocks cannot be replaced or invalidated until counter is zero –Pending-invalidate set on unevictable block –Leading thread stalls when ALAB is full –Must detect and address deadlocks

CS717 9 Cached Load Data - LVQ Load Value Queue (LVQ) –Explicit designation of leading and trailing thread –Only leading thread issues loads and stores –Load addresses and values forward to trailing thread via LVQ –Trailing thread executes loads in-order and non- speculatively (why?) –Input replication guaranteed –Design simpler and less pressure on cache –Earlier fault detection –Constrains scheduling of trailing thread loads

CS Output Comparison Store buffer used to verify address and value of stores to be committed –Trailing thread searches for matching entry –Mismatch means fault occurred Cached load values require no checking Uncached load values could have side effects –Issue non-speculatively, so stall leading thread –Assumes uncached loads are always detected Register Check Buffer used to match register writebacks. –3 register files required: future files + architectural file

CS Enhancing SRT Performance Slack Fetch –Maintain constant lag between thread’s execution –Lead thread updates branch and data predictors –Lead thread prefetches loads –Traditional SMT ICount fetch policy is modified to maintain slack Branch Outcome Queue –Deliver branch outcomes directly to trailing thread –Trailing thread has no control speculation

CS AR-SMT verses SRT AR-SMT only has space redundancy in functional units SRT can potentially have space redundancy across the pipeline AR-SMT is trace processor-based while SRT is conventional Register file of R-stream must be protected AR-SMT forwards load data values AR-SMT checks every instruction during fault detection SRT requires no operating system modifications AR-SMT doesn’t support uncached loads and stores or multiprocessor coherence Delay buffer performs function of register check buffer and branch outcome queue All of main memory is in AR-SMT sphere –Better fault coverage but very costly

CS Simulation Environment Modified Simplescalar “sim-outorder” Long front-end pipeline because of out-of-order nature and SMT Simple approximation of trace cache Used 11 SPEC95 benchmarks

CS Results Again, this paper only analyzes the performance impact of fault tolerance Baseline Characterization –ORH-Dual  two pipelines, each with half the resources –SMT-Dual  replicated threads with no detection hardware ORH and SMT-Dual 32% slower than SMT-Single

CS Slack Fetch & Branch Outcome Queue 10%,14%, 15% (27% max) performance improvements for SF, BOQ, and SF + BOQ Reduced memory stalls through prefetching Prevents trailing thread from wasting resources by speculating Performance better with slack of 256 instructions over 32 or 128

CS Input Replication Assumes output comparison performed by oracle Almost no performance penalty paid for 64- entry ALAB or LVQ With a 16-entry ALAB and LVQ, benchmarks performance degraded 8% and 5% respectively

CS Output Comparison Assumes inputs replicated by oracle Leading thread can stall if store queue is full 64-entry store buffer eliminates almost all stalls Register check buffer or size 32, 64, and 128 entries degrades performance by 27%, 6%, and 1% respectively

CS Overall Results Speedup of SRT processor with 256 slack fetch, branch outcome queue with 128 entries, 64-entry store buffer, and 64-entry load value queue. SRT demonstrates a 16% speedup on average (up to 29%) over a lockstepping processor with the “same” hardware

CS Multi-cycle and Permanent Faults Transient faults could potentially persist for multiple cycles and affect both threads Increasing slack fetch decreases this possibility Spatial redundancy can be increased by partitioning function units and forcing threads to execute on different groups Performance loss for this approach is less than 2%

CS Conclusions Sphere of replication helps analysis of input replication and output comparison Keep register file in sphere LVQ is superior to ALAB (simpler) Slack fetch and branch outcome queue mechanism enhance performance SRT fault tolerance method performs 16% better on average than lockstepping