University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 The StageNet Fabric.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
University of Michigan Advanced Computer Architecture Laboratory StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Instruction-Level Parallelism (ILP)
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Self-calibrated.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
University of Michigan Electrical Engineering and Computer Science 1 Top 5 Reasons Reliability is the Biggest Fallacy in Computer Architecture Research.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
Computer performance.
Revisiting Load Value Speculation:
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Adaptive Online Testing.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
CDA3101 Recitation Section 8
Adaptive Cache Partitioning on a Composite Core
Multiscalar Processors
Lynn Choi School of Electrical Engineering
Architecture & Organization 1
Computer Architecture and Organization
Hyperthreading Technology
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Architecture & Organization 1
Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
Lab 4 Overview: 6-stage SMIPS Pipeline
Ka-Ming Keung Swamy D Ponpandi
Die Stacking (3D) Microarchitecture -- from Intel Corporation
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 The StageNet Fabric for Constructing Resilient Multicore Systems Shantanu Gupta, Shuguang Feng, Amin Ansari, Jason Blome and Scott Mahlke University of Michigan, Ann Arbor

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 2 Journey of Silicon Technology 486 Pentium Pentium II Pentium III Pentium 4 Core Duo Core 2 Quad Perfect transistors Rising Variability and Defects Unreliable Silicon CPU Performance (log scale) Memory redundancy IBM z servers Cell

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 3 Reliability Threats Transient Faults Hard Faults (Manufacturing defects and device wear-out) Manufacturing Defects That Escape Testing (Inefficient Burn-in Testing) Increased Heating Higher Transistor Leakage Thermal Runaway Higher Power Dissipation Parametric Variability (Uncertainty in device and environment) Intra-die variations in ILD thickness [Todd Austin, GSRC Sep 08]

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4 Goal of this Research Reliability is developing into a first class design constraint Design a computing substrate ► Provides scalable fault tolerance ► Highly reconfigurable ► Marginal overheads Enable CMP designs capable of facing 100s of faults while maintaining useful throughput

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 5 Lower complexity Reconfiguration Granularity FETCH DEC EXEC WB MEM CORE levelSTAGE levelMODULE level ElastIC, DT’ 06 Reunion, MICRO’06 Configurable Isolation, ISCA’07 Online Diagnosis of Hard Faults, MICRO’ 05 Ultra Low-Cost Defect Protection, ASPLOS’ 06 Better resource utilization For 100% area overhead (redundancy) -- Poor MTTF gains + Easy to implement + Good MTTF gains + Circuit / Architectural boundary + Full coverage + Best MTTF gains -- Complex implementation 100% MTTF ↑ 170% MTTF ↑ 200% MTTF ↑

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 6 Core 2 Core 0 Core 1 Core 3 CMP Fabric Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 StageN Stage2 Stage3 Stage1 Latch Stage2 Latch Stage3 StageN

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 StageNet (SN) Fabric Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Configuration Manager StageNet Slice (SNS) Crossbar Switch Wearout Sensors Delay Temperature Current

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 8 SN – Benefits Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Configuration Manager

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 9 Outline SN Slice (SNS) architecture SNS performance results SN architecture Lifetime Reliability Evaluation

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 10 StageNet Slice (SNS) – Decoupled uArch IssueFetchDecodeEx/Mem WB LATCH Gen PC Branch Predictor Register File register wb branch resolution bypass 5 stage pipeline SNS DecodeEx/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 11 SNS Performance Hit IssueFetchDecodeEx/Mem WB LATCH Gen PC Branch Predictor Register File register wb branch resolution bypass BR register dependency Commit Time stage pipeline SNS pipeline 3. Transmission delays 2. Data forwarding 1. Control stall Issue Scoreboard DecodeEx/MemFetch Gen PC Branch Predictor Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer > 5X slowdown

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 12 Stream-id : 1-bit to represent the execution path Toggled upon a branch mis-predicted Wrong path instructions are squashed 1. Control Handling using Stream ID DecodeEx/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard SID 0 0 0

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 13 DecodeEx/MemFetch Gen PC Branch Predictor Issue Register File Scoreboard Stream-ID Example SID BR SID 1 squashed committed Branch mispredict Toggle Stream-ID Squash the wrong ones Continue on the right path Toggle Stream-ID

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 14 SNS with Stream-ID DecodeEx/MemFetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard SID BR register dependency Commit Time stage pipeline 123 SNS pipeline 3. Transmission delays Data forwarding 1. Branch induced stall

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 15 SNS - Challenges and Solutions [CASES 08] 1.Control Handling Stream-ID takes care of this 2.Data Forwarding Bypass$ emulates data forwarding - Store previous results - Pass them on to new instructions 3.Transmission Delay Macro-ops are used to amortize delay - Bundles of instructions - Increases system utilization Reduce Feedback Links Conserve Bandwidth Decentralized Control

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 16 Simulation Infrastructure Trimaran Compiler Liberty Simulation Environment Benchmarks Trimaran Assembler HPL-PD Assembly HPL-PD Emulator (FUNCTIONAL) SN Architecture (TIMING) Liberty Simulation Framework Rebel Branch predictorGlobal, 16-bit, gshare predictor Level 1 I/D cache4-way, 16KB, 1 cycle latency Level 2 unified cache 8-way, 64KB, 5 cycle latency

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 Final SNS Performance des g721decode g721encode idct rawcaudio rawdaudio rijndael mcf eqn grep wc Mean Normalized Runtime SNS + StreamID SNS + StreamID + Bypass$ SNS + Stream ID + Bypass$ + MOPs

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 18 SNS – Design Summary DecodeEx/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer Scoreboard SID Bypass $ Packer 1.StreamID – SID registers 2.Bypass$ – Bypass$, Scoreboard 3.Macro-ops – Packer, Buffer sizes double buffer double buffer double buffer double buffer double buffer double buffer ~12% area overhead, ~10% perf. overhead

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 19 SN – Architecture 5 SNSs combined to form SN SN architecture is resilient ► Broken stages can be isolated ► Crossbar switches are redundant ► Interconnection wires are relatively reliable Configuration manager acts upon failures ► Stage borrowing / lending ► Stage sharing

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 20 SN – Stage Borrowing Pipelines borrow / lend stages to form SNSs Exclusive use of stages by SNSs

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 21 SN – Stage Sharing Allow SNSs to share stages Degree of sharing is tunable (2-way, 3-way..)

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 22 Lifetime Reliability Experiments Monte Carlo experiment of ~300 lifetime experiments Where, each experiment involves - ► Assigning a TTF to all the components ► Killing components at their failure times ► Reconfiguring system to isolate broken components ► Computing instantaneous throughput Evaluation for three designs ► Traditional CMP ► SN + borrowing ► SN + borrowing + sharing

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 23 SN – Throughput 4X

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 24 SN – Cumulative Work 50%

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 25 SN Many-core Vision SN, as presented, can not scale to many cores.... How to deploy SN in a 64 core system? ► Create SN blocks – optimal # cores tied together ► Deploy a sparse network b/w blocks Traditional many-core SN block SN SN many-core

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 26 Conclusions Architectural innovations will be crucial in tackling the high failure rates. SN is a potential solution ► 50% more cumulative work ► Low overheads (10% performance, 12% area) SNS, a decoupled pipeline microarchitecture, forms its basis ► Stream-ID ► Bypass$ (not presented) ► Macro-ops (not presented) Ongoing work ► SNS design for aggressive cores ► Optimal SN configuration for many-core systems

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 27 Thank You

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 28 Back up

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 29 SN – Defect Tolerance # Faults Traditional CMP StageNet CMP

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 30 Scoreboard REG ID Valid Issue Scoreboard DecodeEx/MemFetch Gen PC Branch Predictor Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard to handle RAW dependencies Stalls generate backpressure

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 31 Area overhead breakdown Router area for 32 and 64 bit configurations

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 32 Architectural Details

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 33 Stage modifications for SNS

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Bypass$ for data forwarding REG IDVALUE Bypass Cache - Fully associative structure - FIFO replacement policy Key benefits - Reduced stalls - Lower bandwidth consumption DecodeEx/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard SID Bypass $

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 35 SNS with Stream-ID, Bypass$ DecodeEx/MemFetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard SID BR register dependency Commit Time stage pipeline 123 SNS pipeline 3. Transmission delays Data forwarding Bypass $

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Transmission delay Multiple cycles for instruction transfer  Low utilization DecodeEx/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard SID Bypass $

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 37 Need to improve utilization ► Balance transfer and compute time Send instruction bundles ► Macro-ops (MOP) ► Greedy selection policy Advantages ► Removes temp. intermediates ► Parallelizes transfer and compute Hide delay with Macro-ops Max length 4 Max live-ins 2 >> ST LD + / >> & << ST + LD

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 38 SNS with Stream-ID, Bypass$, MOP DecodeEx/MemFetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer Scoreboard SID BR register dependency Commit Time stage pipeline SNS pipeline Transmission delays Bypass $ Packer

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 39 Traditional solutions ► TMR ► Tandem / HP Non-stop ► IBM zSeries …are impractical ► Cost ► Power ► Low gain Tolerating Permanent Faults Current approach 1.Detection 2.Diagnosis ► Using sensors ► Redundant Computation ► BIST 3.Repair ► Replacement ► Reconfiguration K-pos DP-31/32