Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Transient Fault Detection and Recovery via Simultaneous Multithreading Nevroz ŞEN 26/04/2007.

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Reliability. Threads for Fault Tolerance zMultiprocessors: yTransient fault detection.

Multiscalar processors

How Multi-threading can increase on-chip parallelism

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

Presenter: Jyun-Yan Li Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors Antonis Paschalis Department of.

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Pipelining and Parallelism Mark Staveley

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 3) Jonathan Winter.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.

Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Elec/Comp 526 Spring 2015 High Performance Computer Architecture Instructor Peter Varman DH 2022 (Duncan Hall) rice.edux3990 Office Hours Tue/Thu.

PipeliningPipelining Computer Architecture (Fall 2006)

Fall 2012 Parallel Computer Architecture Lecture 13: Multithreading III Prof. Onur Mutlu Carnegie Mellon University 10/5/2012.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Computer Architecture: Multithreading (III)

Computer Structure Multi-Threading

UnSync: A Soft Error Resilient Redundant Multicore Architecture

Multi-Core Computing Osama Awwad Department of Computer Science

Computer Architecture: Multithreading (I)

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Simultaneous Multithreading in Superscalar Processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Chapter 1 Introduction.

* From AMD 1996 Publication #18522 Revision E

Hardware Assisted Fault Tolerance Using Reconfigurable Logic

Levels of Parallelism within a Single Processor

Hardware Multithreading

Advanced Architecture +

The University of Adelaide, School of Computer Science

University of Wisconsin-Madison Presented by: Nick Kirchem

Presentation transcript:

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra Singh Supercomputer Education and Research Center, Indian Institute of Science, Bangalore, India Kewal K. Saluja Electrical and Computer Engg. Dept., University of Wisconsin-Madison, Madison, WI Erik Larsson Dept. of Computer and Info. Science, Linkoping University, Linkoping, Sweden Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010 Cite count: 16

Continued CMOS scaling is expected to make future microprocessors susceptible to transient faults, hard faults, manufacturing defects and process variations causing fault tolerance to become important even for general purpose processors targeted at the commodity market. To mitigate the effect of decreased reliability, a number of fault-tolerant architectures have been proposed that exploit the natural coarse-grained redundancy available in chip multiprocessors (CMPs). These architectures execute a single application using two threads, typically as one leading thread and one trailing thread. Errors are detected by comparing the outputs produced by these two threads. These architectures schedule a single application on two cores or two thread contexts of a CMP. 2

As a result, besides the additional energy consumption and performance overhead that is required to provide fault tolerance, such schemes also impose a throughput loss. Consequently a CMP which is capable of executing 2n threads in non-redundant mode can only execute half as many (n) threads in fault-tolerant mode. In this paper we propose multiplexed redundant execution (MRE), a low-overhead architectural technique that executes multiple trailing threads on a single processor core. MRE exploits the observation that it is possible to accelerate the execution of the trailing thread by providing execution assistance from the leading thread. 3

Execution assistance combined with coarse-grained multithreading allows MRE to schedule multiple trailing threads concurrently on a single core with only a small performance penalty. Our results show that MRE increases the throughput of fault-tolerant CMP by 16% over an ideal dual modular redundant (DMR) architecture 4

Chip multiprocessors (CMPs) become the major for performance growth  Susceptible to soft errors, wear-out related permanent fault … 2 cores or thread contexts execute single program in the CMP  Throughput loss 。 The throughput of the CMP decreases to half  System cost 。 Cooling, energy and maintenance cost 5

6 AR-SMT [22] AR-SMT [22] SRT [21] SRT [21] CRT [18] CRT [18] CRTR [13] CRTR [13] SRTR [29] SRTR [29] Razor [11] Razor [11] Power efficient redundant execution [26] Power efficient redundant execution [26] Dynamic frequency and voltage scaling to reduce power Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors This paper: Adding recovery Using SMT to detect transient fault Leading thread stores results in a delay buffer, and trailing thread re- executes and compare result Replicating critical pipeline registers and comparing them to detect error

Input replication  Issue input value to the both threads  Optimize trailing thread 。 Load Value Queue (LVQ) for the loading data 。 Branch Outcome Queue (BOQ) for the fetching instruction Output comparator  Verifying results of the both threads before they are forward to the rest of the system  Store queue prevents store data passing before comparing data 7

Multiplexed Redundant Execution (MRE)  Logical partition cores 。 Leading core pool and Trailing core pool  Executing applications that require fault tolerance 。 3th pool  Non-redundant applications  Chunk: execution of the application 。 Sent a message to trailing core by leading core 。 Push into the Run Request Queue (RRQ) by trailing core 8

Input Replication  Using LVQ to accelerate trailing thread loading 。 Leading thread transfers result to trailing core’s lVQ after the load instruction retirement 。 Trailing thread load data from it  Eliminate cache miss  Using BOQ to eliminate misprediction 。 Leading thread’s branch outcome stores in the BOQ after the load instruction retirement 。 Trailing thread accesses branch prediction from it  Eliminate branch misprediction Interconnect  Dividing cores into clusters and connected by bus interconnect 9

10 Run Request Queue, trailing core loads chunks and inserted by leading thread Branch Outcome Queue, trailing core predicts branch outcome and inserted by leading thread Load Value Queue, trailing core loads value and inserted by leading thread Exchange fingerprint to another core for detecting fault

11 Storage space Thread Storage space Thread regs core Copy state New thread’s state

Sharing the LVQ and BOQ for maximum utilization  Allocating a section dynamically by the on-demand  Free queue: a list of unused sections 。 Share with each threads  Allocated queue: a list of allocated sections 12 section Free queue Allocated queue Thread section LVQ 1234 allocated 5 section value

Fault Detection  Exchanging executing fingerprint 。 Execution results should be the same  Branch mispredection in the trailing thread 。 Never mispredection in the trailing thread Fault Isolation  Fault must not propagate to other 。 Adding speculative (S) bit in the D$  S=1, when write data  this cache line is locked and can’t be writing back to memory  If need replaced, then fingerprint should be compared  S=0, compare result is match 。 I/O operation  Take a checkpoint and compare fingerprints before I/O operation 13

Checkpoint and Recover  If compare result is match, leading core store all registers in the checkpoint store  If not match 。 Recover register states of the two core form the checkpoint store 。 Invalidate data cache line with speculative bit (S=1) 。 Leading core re-executes after the last checkpoint Fault coverage  Processor logic and certain part of memory access circuitry 。 Cache and memory controller can’t be detect  Fingerprint aliasing 14

Hardware & software cost Compare with SRT and CRT  Ability of fault tolerant  Performance degrade or upgrade 。 throughtput  Power consumption 15

16

17 Comparison of stores between the leading and trailing thread increase interconnect traffic CRT compare store value by the store comparator in the store queue Comparison of stores between the leading and trailing thread increase interconnect traffic CRT compare store value by the store comparator in the store queue Store buffer of leading thread waits trailing thread’s stores -2% -18%

Multiplexed Redundant Execution (MRE)  Mitigating the throughput loss due to redundant execution 。 A trailing core executes redundant from the many of leading cores by the RRQ and coarse-grained multithreading My comment  Load Value Queue (LVQ) and Branch Outcome Queue (BOQ) 。 Reduce extra loading time form the low-level memory hierarchy  Not describe lading core which type of multithreading 19

One of the thread-level parallelism (TLP)  Executing multiple instructions from multiple thread at the time Architecture  superscalar 20 Picture from “Computer Architecture – a Quantitative Approach”, John L. Hennessy, David A. Patterson