University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.
School of Engineering & Technology Computer Architecture Pipeline.
Nikos Hardavellas, Northwestern University
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric.
Microprocessor Reliability
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Shoestring: Probabilistic.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Top 5 Reasons Reliability is the Biggest Fallacy in Computer Architecture Research.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Presenter: Jyun-Yan Li A software-based self-test methodology for in-system testing of processor cache tag arrays G. Theodorou, N. Kranitis, A. Paschalis.
University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand I2PC March 28, 2013 Amin Ansari 1, Shuguang Feng 2, Shantanu Gupta 3, Josep.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Grad Student Visit DayUniversity of Wisconsin-Madison Wisconsin Computer Architecture Guri SohiMark HillMikko LipastiDavid WoodKaru Sankaralingam Nam Sung.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Wafer Edge Exclusion Kevin Fisher.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 The StageNet Fabric.
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
CS203 – Advanced Computer Architecture
University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,
Hardware Architecture
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
On Reliable Modular Testing with Vulnerable Test Access Mechanisms Lin Huang, Feng Yuan and Qiang Xu.
Computer Systems Nat 4/5 Computing Science Computer Structure:
High Performance Computer Architecture:
CS203 – Advanced Computer Architecture
Adaptive Cache Partitioning on a Composite Core
Uniprocessor Performance
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Morgan Kaufmann Publishers
Computer Architecture and Organization
Scott Mahlke University of Michigan
Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
Coe818 Advanced Computer Architecture
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Engineer What education and training is required to become an engineer? Read more:
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Necromancer: Enhancing System Throughput by Animating Dead Cores Authors: Amin Ansari Shuguang Feng* Shantanu Gupta Scott Mahlke ISCA-37 June 21-23, 2010 * presenter

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science  Hard-faults  Intrinsic (silicon defects)  Extrinsic (impurities, litho imperfections)  One defect per five 100mm 2 dies expected (ITRS)  Threatens manufacturing yield  Currently resolved with core disabling (e.g., IBM Cell) Manufacturing Defects 2

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Improving Yield w/o Core Disabling 3  Large % of chip area  Regular design and behavior  Many existing solutions  Large % of chip area  Regular design and behavior  Many existing solutions On-chip Caches  Significant % of chip area  Inherently complex and irregular  Must be addressed to improve overall yield  Significant % of chip area  Inherently complex and irregular  Must be addressed to improve overall yield Processing Cores

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Necromancer (NM) 4  Goal:  Maintain the overall performance of a CMP in the face of hard-faults (in processing cores)  Intuition:  A core with a hard-fault (a “dead” core) may still be able to perform useful work  Utilize dead cores to mitigate performance loss

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Impact of Hard-Faults on Program Execution 5  % of injected hard-faults that manifest as architectural state* different latencies (# of committed instructions)  More than 40% of the injected faults cause an immediate architectural state* mismatch (<10K instructions)  A faulty core cannot be trusted to perform correctly even for short periods of program execution  More than 40% of the injected faults cause an immediate architectural state* mismatch (<10K instructions)  A faulty core cannot be trusted to perform correctly even for short periods of program execution

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Relax Correctness Constraint 6  Similarity Index: % of committed PCs matching between a faulty and golden execution 1K instruction intervals) At a similarity index of 90%, more than 85% of the faulty cores can successfully commit at least 100K instructions

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Using the (Un)dead Core to Generate Hints 7  Observation:  The execution of a program on a faulty core, although imperfect, coarsely resembles a fault-free execution  Proposal:  Use the faulty, “dead”, core to accelerate a fault-free core running the same application  Extract useful information from the (un)dead core and send it as hints to the fault-free core, the “animator” core (Un)dead Core (Un)dead Core Animator Core Animator Core Hints Performance

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science  Original Performance  IPC of different Alpha microprocessors (normalized to an EV4)  Performance w/ Hints  Perfect branch prediction  No L1 cache misses With perfect hints, most of the simpler cores (EV4, EV5, and EV4-OoO) can achieve a performance comparable to that of the 6-issue OoO EV6 Opportunities for Acceleration 8 Increasing complexity/resources

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Traditional Core Coupling 9  Typically configured as leader/follower cores where the leader runs ahead and attempts to accelerates the follower  Slipstream  Master/slave Speculation  Flea Flicker  Dual-core Execution  Paceline  DIVA The leader runs ahead by executing a “pruned” version of the application The leader speculates on long-latency operations The leader is aggressively frequency scaled (reduced safety margins) A smaller follower core simplifies the design/verification of the leader core Conventional coupling solutions cannot operate in the presence of frequent faults

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science (Faulty) Core Coupling Challenges 10  Frequent Fine-Grained Variations  Must identify “robust” hints  Even robust hints are not always reliable  Necessitates fine-grained hint disabling  The undead may execute/commit more or fewer instructions than the animator  Difficult to determine when to apply hints  Occasional Global Divergences  Requires periodic resynchronizations with the animator  Online monitoring needed to identify synchronization periods

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Necromancer Architecture 11 L1-Data Shared L2 cache Read-Only Animator Core L1-Data Communication Queue tailhead L1-Inst Resynchronization and hint disabling Undead Core Memory Hierarchy A robust heterogeneous core coupling design Inter-core Communication  Undead → Animator  Hints sent through single unified FIFO queue  Animator → Undead  Resynchronization data (architectural state)  Hint disabling signals Inter-core Communication  Undead → Animator  Hints sent through single unified FIFO queue  Animator → Undead  Resynchronization data (architectural state)  Hint disabling signals The Undead  Serves as an external run-ahead engine for the animator core  Executes an identical copy of the program  Supplies hints to the animator  I$: PC of committed instructions  D$: address of committed loads and stores  Branch prediction: predictor updates  Dirty D$ dirty lines are not written back  Exception generation/handling disabled The Undead  Serves as an external run-ahead engine for the animator core  Executes an identical copy of the program  Supplies hints to the animator  I$: PC of committed instructions  D$: address of committed loads and stores  Branch prediction: predictor updates  Dirty D$ dirty lines are not written back  Exception generation/handling disabled The Animator  An older version of the undead core with the same ISA and less resources (i.e., a previous generation)  Consumes hints to improve performance  Prefetches on $ hints  Branch predictor hints improves speculation accuracy  Dynamic hint disabling based on online monitoring  Provides architecturally correct state for resynchronization The Animator  An older version of the undead core with the same ISA and less resources (i.e., a previous generation)  Consumes hints to improve performance  Prefetches on $ hints  Branch predictor hints improves speculation accuracy  Dynamic hint disabling based on online monitoring  Provides architecturally correct state for resynchronization

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Example: Branch Predictor Hints 12 L1-Data Shared L2 cache Read-Only Animator Core L1-Data Communication Queue tailhead L1-Inst Resynchronization and hint disabling Undead Core Memory Hierarchy Hint Gathering DECRENDISEXEMEMCOM Cache Fingerprint PC NPC Hint Format Type Age PC NPC FEDEREDIEXMECO Hint Distribution Hint Disabling Buffer Age tag ≤ # committed instructions + Δ Type Age PC NPC Age FE FET

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Example: Branch Predictor Hints 13 L1-Data Shared L2 cache Read-Only Animator Core L1-Data Communication Queue tailhead L1-Inst Resynchronization and hint disabling Undead Core Memory Hierarchy Hint Gathering FETDECRENDISEXEMEMCOM Cache Fingerprint FEDEREDIEXMECO Hint Distribution Hint Disabling FE Tournament Predictor PCNPC Original AC Predictor PCNPC NM Predictor Branch Prediction PCNPC FE Undead update

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Coarse-grained Branch Prediction Disabling 14 L1-Data Shared L2 cache Read-Only Animator Core L1-Data Communication Queue tailhead L1-Inst Resynchronization and hint disabling Undead Core Memory Hierarchy Hint Gathering FETDECRENDISEXEMEMCOM Cache Fingerprint FEDEREDIEXMECO Hint Distribution Hint Disabling Prediction Outcomes Original BPNM BPAction  --     Counter > Threshold Disable Hint Hint Disabling

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science NM Design for CMP Systems 15

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation Methodology 16  Area-weighted Monte Carlo fault injection (microarchitectural simulations)  Performance  Heavily modified SimAlpha  SPEC-CPU-2k w/ SimPoint  Power  Wattch, HotLeakage, and CACTI  Area  Synopsys 90nm  Undead Core  Modeled after an OoO EV6  Animator Core  Modeled after an OoO EV4  Limited resources v. undead core (e.g., 8K D$ v. 64K D$) [Fault Injection Sites]

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Impact of Fault Location on Performance 17 Program Counter Instruction Fetch Queue Integer ALU

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Performance Gain 18 88% *Live core: a fault-free version of the undead core 72%

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Area and Power Overheads 19

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusion  Faulty, “dead” cores can be revived to perform useful work  Coupling faulty cores presents unique challenges  Necromancer exploits efficient microarchitectural enhancements to provide  Intrinsically robust hints (BP, I$ and D$ prefetching)  Fine and coarse-grained hint monitoring/disabling  Dynamic inter-core state resynchronization (see paper)  In a 4-core CMP, Necromancer  Recovers, on average, 88% of an undead core’s original performance  Incurs modest area and power overheads of 5.3% and 8.5% 20

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Questions? 21