Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz ŞEN 26/04/2007.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Korey Sewell*, Trevor Mudge*, Steven K. Reinhardt* † *Advanced Computer Architecture Labaratory (ACAL) University of Michigan, Ann Arbor † Advanced Micro.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Multi-core architectures. Single-core computer Single-core CPU chip.
1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Hyper-Threading Technology Architecture and Microarchitecture
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Fall 2012 Parallel Computer Architecture Lecture 13: Multithreading III Prof. Onur Mutlu Carnegie Mellon University 10/5/2012.
Presented by: Nick Kirchem Feb 13, 2004
Multiscalar Processors
Computer Architecture: Multithreading (III)
CSC 4250 Computer Architectures
Computer Structure Multi-Threading
Hyperthreading Technology
Computer Architecture Lecture 4 17th May, 2006
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
COMS 361 Computer Organization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U. of Michigan Versions of this work have been presented at ISCA 2000 and ISCA 2002

2 Transient Faults from Cosmic Rays & Alpha particles + decreasing feature size - decreasing voltage (exponential dependence?) - increasing number of transistors (Moore’s Law) - increasing system size (number of processors) - no practical absorbent for cosmic rays

3 Fault Detection via Lockstepping (HP Himalaya) R1  (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1  (R2) microprocessor Replicated Microprocessors + Cycle-by-Cycle Lockstepping

4 Fault Detection via Simultaneous Multithreading R1  (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1  (R2) THREAD Replicated Microprocessors + Cycle-by-Cycle Lockstepping Threads ?

5 Simultaneous Multithreading (SMT) Functional Units Instruction Scheduler Thread1 Thread2 Example: Alpha 21464, Intel Northwood

6 Redundant Multithreading (RMT) Multithreading (MT) Redundant Multithreading (RMT) MultithreadedUniprocessorSimultaneous Multithreading (SMT) Simultaneous & Redundant Threading (SRT) Chip Multiprocessor (CMP) Multiple Threads running on CMP Chip-Level Redundant Threading (CRT) RMT = Multithreading + Fault Detection

7 Outline  SRT concepts & design  Preferential Space Redundancy  SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads  Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis  Summary  Current & Future Work

8 Overview  SRT = SMT + Fault Detection  Advantages Piggyback on an SMT processor with little extra hardware Piggyback on an SMT processor with little extra hardware Better performance than complete replication Better performance than complete replication Lower cost due to market volume of SMT & SRT Lower cost due to market volume of SMT & SRT  Challenges Lockstepping very difficult with SRT Lockstepping very difficult with SRT Must carefully fetch/schedule instructions from redundant threads Must carefully fetch/schedule instructions from redundant threads

9 Sphere of Replication  Two copies of each architecturally visible thread Co-scheduled on SMT core Co-scheduled on SMT core  Compare results: signal fault if different Memory System (incl. L1 caches) Sphere of Replication Output Comparison Input Replication Leading Thread Trailing Thread

10 Basic Pipeline Fetch DecodeDispatchCommit Execute Data Cache

11  Load Value Queue (LVQ) Keep threads on same path despite I/O or MP writes Keep threads on same path despite I/O or MP writes Out-of-order load issue possible Out-of-order load issue possible Load Value Queue (LVQ) Fetch DecodeDispatchCommit Execute Data Cache LVQ

12 Store Queue Comparator (STQ)  Store Queue Comparator Compares outputs to data cache Compares outputs to data cache Catch faults before propagating to rest of system Catch faults before propagating to rest of system Fetch DecodeDispatchCommit Execute Data Cache STQ

13 Store Queue Comparator (cont’d)  Extends residence time of leading-thread stores Size constrained by cycle time goal Size constrained by cycle time goal Base CPU statically partitions single queue among threads Base CPU statically partitions single queue among threads Potential solution: per-thread store queues Potential solution: per-thread store queues  Deadlock if matching trailing store cannot commit Several small but crucial changes to avoid this Several small but crucial changes to avoid this st... st 5  [0x120] st... Store Queue Compare address & data to data cache st 5  [0x120]

14 Branch Outcome Queue (BOQ)  Branch Outcome Queue Forward leading-thread branch targets to trailing fetch Forward leading-thread branch targets to trailing fetch 100% prediction accuracy in absence of faults 100% prediction accuracy in absence of faults Fetch DecodeDispatchCommit Execute Data Cache BOQ

15 Line Prediction Queue (LPQ) Fetch DecodeDispatchCommit Execute Data Cache LPQ  Line Prediction Queue Alpha fetches chunks using line predictions Alpha fetches chunks using line predictions Chunk = contiguous block of 8 instructions Chunk = contiguous block of 8 instructions

16 Line Prediction Queue (cont’d)  Generate stream of “chunked” line predictions Every leading-thread instruction carries its I-cache coordinates Every leading-thread instruction carries its I-cache coordinates Commit logic merges into fetch chunks for LPQ Commit logic merges into fetch chunks for LPQ – Independent of leading-thread fetch chunks – Commit-to-fetch dependence raised deadlock issues 1F8:add 1FC:load R1  (R2) 200:beq :and 208:bne :add Chunk 1: end of cache line Chunk 2: taken branch

17 Line Prediction Queue (cont’d)  Read-out on trailing-thread fetch also complex Base CPU “thread chooser” gets multiple line predictions, ignores all but one Base CPU “thread chooser” gets multiple line predictions, ignores all but one Fetches must be retried on I-cache miss Fetches must be retried on I-cache miss  Tricky to keep queue in sync with thread progress Add handshake to advance queue head Add handshake to advance queue head Roll back head on I-cache miss Roll back head on I-cache miss – Track both last attempted & last successful chunks

18  SRT concepts & design  Preferential Space Redundancy  SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads  Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis  Summary  Current & Future Work Outline

19 Preferential Space Redundancy  SRT combines two types of redundancy Time: same physical resource, different time Time: same physical resource, different time Space: different physical resource Space: different physical resource  Space redundancy preferable Better coverage of permanent/long-duration faults Better coverage of permanent/long-duration faults  Bias towards space redundancy where possible

20  Base CPU has two execution clusters Separate instruction queues, function units Separate instruction queues, function units Steered in dispatch stage Steered in dispatch stage add r1,r2,r3 PSR Example: Clustered Execution Fetch DecodeDispatchCommit Exec 0IQ 0 Exec 1IQ 1 add r1,r2,r3 LPQ

21  Leading thread instructions record their cluster Bit carried with fetch chunk through LPQ Bit carried with fetch chunk through LPQ Attached to trailing-thread instruction Attached to trailing-thread instruction Dispatch sends to opposite cluster if possible Dispatch sends to opposite cluster if possible 0 PSR Example: Clustered Execution Fetch DecodeDispatchCommit Exec 0IQ 0 Exec 1IQ add r1,r2,r3 [0] LPQ 0

22  99.94% of instruction pairs use different clusters Full spatial redundancy for execution Full spatial redundancy for execution No performance impact (occasional slight gain) No performance impact (occasional slight gain) add r1,r2,r3 [0] PSR Example: Clustered Execution Fetch DecodeDispatchCommit Exec 0IQ 0 Exec 1IQ 1 add r1,r2,r3 [0] add r1,r2,r3 LPQ

23  SRT concepts & design  Preferential Space Redundancy  SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads  Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis  Summary  Current & Future Work Outline

24 SRT Evaluation  Used SPEC CPU95, 15M instrs/thread Constrained by simulation environment Constrained by simulation environment  120M instrs for 4 redundant thread pairs  120M instrs for 4 redundant thread pairs  Eight-issue, four-context SMT CPU 128-entry instruction queue 128-entry instruction queue 64-entry load and store queues 64-entry load and store queues – Default: statically partitioned among active threads 22-stage pipeline 22-stage pipeline 64KB 2-way assoc. L1 caches 64KB 2-way assoc. L1 caches 3 MB 8-way assoc L2 3 MB 8-way assoc L2

25 SRT Performance: One Thread  One logical thread  two hardware contexts  Performance degradation = 30%  Per-thread store queue buys extra 4%

26 SRT Performance: Two Threads  Two logical threads  four hardware contexts  Average slowdown increases to 40%  Only 32% with per-thread store queues

27  SRT concepts & design  Preferential Space Redundancy  SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads  Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis  Summary  Current & Future Work Outline

28 Chip-Level Redundant Threading  SRT typically more efficient than splitting one processor into two half-size CPUs  What if you already have two CPUs? IBM Power4, HP PA-8800 (Mako) IBM Power4, HP PA-8800 (Mako)  Conceptually easy to run these in lock-step Benefit: full physical redundancy Benefit: full physical redundancy Costs: Costs: – Latency through centralized checker logic – Overheads (misspeculation etc.) incurred twice  CRT combines best of SRT & lockstepping requires multithreaded CMP cores requires multithreaded CMP cores

29 Chip-Level Redundant Threading CPU A Leading Thread A Trailing Thread B CPU B Trailing Thread A Leading Thread B LVQ Stores LPQ Stores LPQ LVQ

30 CRT Performance  With per-thread store queues, ~13% improvement over lockstepping with 8-cycle checker latency

31 Summary & Conclusions  SRT is applicable in a real-world SMT design ~30% slowdown, slightly worse with two threads ~30% slowdown, slightly worse with two threads Store queue capacity can limit performance Store queue capacity can limit performance  Preferential space redundancy improves coverage  Chip-level Redundant Threading = SRT for CMPs Looser synchronization than lockstepping Looser synchronization than lockstepping Free up resources for other application threads Free up resources for other application threads

32 More Information  Publications S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 Papers available from: Papers available from: – –  Patents Compaq/HP filed eight patent applications on SRT Compaq/HP filed eight patent applications on SRT