University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries for Robust and Configurable Performance Shantanu Gupta Shuguang Feng Amin Ansari Scott Mahlke University of Michigan, Ann Arbor December 7, 2010 43 rd International Symposium on Microarchitecture

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Multicore Architectures Industry wide move to multicores ► 2 – 16 cores on a single die Multiple challenges confront them: ► Single-thread performance ► Reliability ► Power density ► Memory bandwidth ► …. Our hypothesis: A highly configurable architecture can handle these issues in a unified manner. IBM Cell 2 Sun Niagara 2 Intel 4 Core Nehalem

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Multicore Performance Challenge 3 Sequential workloads (legacy workloads, most mobile/desktop apps) Parallel workloads (scientific computing, newer web browsers, video decoding) Spectrum of Applications 2. Stagnating sequential performance 486 Pentium Pentium II Pentium III Pentium 4 Core Duo Core 2 Quad CPU Performance (log scale) Core i7 Power wall 1. Good throughput / parallel performance with more cores Need flexibility to provide both Parallel and Sequential performance Need flexibility to provide both Parallel and Sequential performance

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Solution: Configurable Performance Assign resources where they are needed… In an N core chip: ► Use all N cores for best Parallel Performance ► Group M cores together for Serial Performance (M < N) Core Fusion, ISCA’07; Composable Lighweight Processors, MICRO’07 Source: Mark D. Hill Parallel / ThroughputSerial / Sequential 4

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Multicore Reliability Challenge 5 Electromigration (EM) Oxide breakdown (OBD) Intra-die variations in ILD thickness Negative Bias Threshold Inversion Manufacturing Defects That Escape Testing (Inefficient Burn-in Testing) Increased Heating Higher Transistor Leakage Thermal Runaway Higher Power Dissipation [Todd Austin, GSRC Sep 08] HardFaults ParametricVariability Need mechanisms for in-field silicon failures Need mechanisms for in-field silicon failures

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Solution: Isolate Broken Resources 6 CORE level ElastIC, DT’ 06 Configurable Isolation, ISCA’07 Online Diagnosis of Hard Faults, MICRO’ 05 Ultra Low-Cost Defect Protection, ASPLOS’ 06 MODULE level Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 STAGE level StageNet, MICRO 08 Core Cannibalization, PACT 08 - StageNet decouples the pipeline stages - Regular fabric, no global interconnections - Any set of stages can be connected to form a pipeline

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Point Solutions: Summary and Limitations 7 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage level isolation Fuse cores for higher single-thread performance Configurable Performance Reliability

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage1StageNStage2Stage3 Stage level isolation Fuse cores for higher single-thread performance Point Solutions: Summary and Limitations 1.Solve only one challenge at a time 2.Incur additive overheads, no resource overlap 3.Are incompatible with one another 8 Tightly coupled resources Centralized structures for data and control management Decoupled resources Distributed data and control management Our Goal: Design an architectural solution, which 1.Simultaneously targets configurable performance and reliability 2.Overlaps hardware changes, and 3.Resolves any conflicting requirements Our Goal: Design an architectural solution, which 1.Simultaneously targets configurable performance and reliability 2.Overlaps hardware changes, and 3.Resolves any conflicting requirements

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science The CoreGenesis (CG) Architecture Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue 9 Regular grid of pipeline stages. No explicit core boundary. Stages interconnected by full crossbars Distributed structures for data and control management Crossbar Switch Distributed Structures 1. Throughput

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science The CoreGenesis (CG) Architecture Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Fetch Ex/Mem Decode Issue Single pipeline processor Conjoined pipelines processor 10 2. Reliability 3. Configurable Performance1. Throughput Advantages: 1.Unified performance / reliability solution 2.Overlaps hardware overheads 3.Regular fabric 4.No centralized resources for fetch, issue, operand copying

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Microarchitectural Hurdles Single Pipeline: Solved by the StageNet design, MICRO’08 ► Stream Identification bits for Control Flow ► Bypass cache (inside EXEC. stage) for Register Data Flow 11 Control Flow - Instruction sequence needs to be managed across fetch stages Instruction Issue - Segregate data flow chains between pipelines Register and Memory Data Flow - Detection of cross pipeline register and memory data flow violations - Recovery to a consistent architectural state

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Overview 12 Ex/Mem Issue Fetch Decode Issue Fetch Ex/Mem Decode Conjoined pipelines processor Distributed fetch Distributed decode Decentralized Instruction issue Detection of data flow violations In-order Writeback (broadcasted) 1. Control Flow 2. Register Data Flow tracking 3. Memory Data Flow tracking 4. Replay Mechanism 5. Instruction Issue

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Control Flow Distributed Fetch. - Pipelines fetch alternate instructions - Branch predictors are kept in sync. Advantages - Evenly splits the work (fetch, decode, issue) between two pipelines - No explicit communication required for control decisions - Consistent control decisions due to mirrored branch predictors 13 Decode Fetch Decode Fetch 10..8..6..4..2 9..7..5..3..1

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Data Flow Across pipeline dependencies are tricky…. Register Data Flow 1. Issue stages locally maintain a table of source registers 2. Issue stages monitor write-backs, and detect if any other pipeline updates a source for an outstanding instruction 3. Missed dependency  initiate a light-weight replay 14 Instruction stream Split instruction stream Local decisions and execution Compare notes at commit time Replay if any dependency was violated

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Register Data Flow: Example 15 Execute Issue Execute Issue 3, 1 4, 2 1.R1 = …. 2.R2 = …. 3.… = R1 4. … = R2 Scenario A 1.R1 = …. 2.R2 = …. 3.… = R2 4.… = R2 Scenario B Data flow violation! Pipeline 1 used a stale value of R2 How can we avoid these violations? R1 = …. R2 = …. R1 = …. R2 = …. … = R1 … = R2 … = R1 … = R2 R1 R2 R1R2 R1 = …. R2 = …. R1 = …. R2 = …. … = R2 R2 R1R2

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Instruction Issue Instructions can be: ► Straight steered ► Cross steered Objective: match producers and consumers Mismatch  Data Flow violation  Replay Solution: Use static compiler analysis to generate steering hints Ex/Mem Issue Ex/Mem Issue straight cross 16

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Instruction Issue: Example 1 10 9 6 4 5 7 8 3 2 Ex/Mem Issue Ex/Mem Issue straight cross Fetch order 10..8..6..4..2 9..7..5..3..1 Always straight steering Ignores data dependencies Number of replays = 5 Compiler orchestrated steering Use clustering algorithms Accounts for dependencies and communication delays Number of replays = 0 Critical cross dependency 17

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG – Design Summary 18 Ex/Mem Issue Fetch Decode Issue Fetch Ex/Mem Decode 1.Control Flow - Pipelines fetch alternate instructions - Branch predictors kept in sync 2. Register Data Flow - Maintain local data flow information - Check the decisions at writeback 3. Memory Data Flow tracking4. Replay Mechanism 5. Instruction Issue - Steer consumers to producers - Leverage static compiler analysis

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Evaluation Methodology Liberty Simulation Infrastructure ► For cycle accurate simulations Trimaran Compilation System ► For instruction steering hints Experiments: ► Single-thread performance gain from conjoining ► Throughput improvement from conjoining (at low utilizations) ► Throughput sustainability (in face of failures) 19 Branch predictorGlobal, 16-bit, gshare predictor Level 1 I/D cache4-way, 16KB, 1 cycle latency Level 2 unified cache 8-way, 64KB, 5 cycle latency Microarchitectural Paramenters

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Sequential Performance 20

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Throughput at varying utilization 21

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Throughput Sustainability (Reliability) 22

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Conclusions Architectural flexibility can tackle multiple multicore challenges CoreGenesis is our attempt at a unified performance and reliability solution ► Decentralized instruction flow management to combine resources for higher single-thread performance ► Decoupled pipeline architecture to allow stage level reconfiguration Results: ► Combining two single issue pipelines gives 40% speedup ► Sustains the same throughput for up to 70% longer ► Overheads: 20% area, 17% power 23

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Thank you 24 http://cccp.eecs.umich.edu Erasing Core Boundaries for Robust and Configurable Performance

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Back up slides 25

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Traditional Solutions and CoreGenesis (CG) ThroughputSequential (A)Dynamic Multicore. Cores can fuse together when sequential performance is needed. (B) Core Disabling. Isolates broken cores (red). Sustains throughput only in low failure rates. (C) Heterogeneous CMP. Maintains a variety of cores to offer power- proportional computing. The architecture is composed of a sea of building blocks (B). These blocks can be configured for: Throughput computing: By forming single-issue pipelines Single-thread performance: By forming wider-issue pipelines Fault-tolerance: By decommissioning broken blocks. Customized processing: Heterogeneous building blocks can be introduced in the fabric to form customized pipelines. CoreGenesis Vision Traditional point solutions 26

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science CG Instance: A Unified Performance-Reliability Solution Design Characteristics Elementary pipeline stages form the building blocks Stages interconnected using full crossbars. No global flush, stall or forwarding signals. No modifications to the cache hierarchy Single PipelineConjoint Pipelines Control Flow Register Data Flow Memory Data Flow Instruction Steering Summary of Challenges Provides.. 1.Configurable Performance: By merging varying number of stages 2.Reliability: By isolating broken stages 27

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Decoupling Stages in a Pipeline [MICRO’08] 2. Data Flow Bypass $ Stores previous results Fully associative structure Emulates data forwarding Stream ID (SID) Control flow handling Eliminates flush signals 3. Transmission Delays 1. Control Flow >> ST LD + / >> & << ST + LD Macro-Ops Send instruction bundles Amortizes transfer delay Increases system utilization 0 1 Decode Ex/Mem Fetch Gen PC Branch Predictor Issue Register File double buffer double buffer double buffer double buffer double buffer double buffer double buffer SID Macro-op Generator Bypass $ 28

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Replays 29

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Area 30

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Power 31

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries.

Similar presentations

Presentation on theme: "University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Erasing Core Boundaries."— Presentation transcript:

Similar presentations

About project

Feedback