Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois.

Slides:

Advertisements

Similar presentations

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

COMP Clockless Logic and Silicon Compilers Lecture 3

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Multiscalar processors

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.

Pipelining and Parallelism Mark Staveley

Chapter One Introduction to Pipelined Processors

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Processor Types And Instruction sets Chapter- 5.

Fetch Directed Prefetching - a Study

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.

1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

PipeliningPipelining Computer Architecture (Fall 2006)

Buffering Techniques Greg Stitt ECE Department University of Florida.

Instruction Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

SECTIONS 1-7 By Astha Chawla

Morgan Kaufmann Publishers

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

CDA 3101 Spring 2016 Introduction to Computer Organization

Instruction Level Parallelism and Superscalar Processors

Superscalar Processors & VLIW Processors

Milad Hashemi, Onur Mutlu, Yale N. Patt

* From AMD 1996 Publication #18522 Revision E

High Performance Asynchronous Circuit Design and Application

Instruction Execution Cycle

Computer Architecture

Presentation transcript:

Decoupled Pipelines: Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign

2 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results

3 Motivation Why Asynchronous? No clock skew No clock distribution circuitry Lower power (potentially) Increased modularity But what about performance? What is the architectural benefit of removing the clock?  Decoupled Pipelines!

4 Motivation Advantages of a Decoupled Pipeline Pipeline achieves average-case performance Rarely taken critical paths no longer affect performance New potential for average-case optimizations

5 Synchronous vs. Decoupled goack data goack data Stage1Stage2Stage3 Control3Control2Control1 Decoupled Stage1Stage2Stage3 Synchronous data clock Asynchronous Communication Protocol Self-Timing Logic Elastic Buffer Synchronous Latch Synchronizing mechanism

6 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results

7 Self-Timed Logic Bounded Delay Model Definition: event = signal transition start event provided when inputs are available done event produced when outputs are stable  Fixed delay based on critical path analysis Computational circuit is unchanged Computational Circuit Self-Timing Circuit Input StartDone Output

8 Asynchronous Logic Gates C-gate  logical AND Waits for events to arrive on both inputs XOR-gate  logical OR Waits for an event to arrive on either input SEL-gate  logical DEMUX Routes input event to one of the outputs SELSEL 1 XORXOR C 0

9 Asynchronous Communication Protocol 2-Step, Event Triggered, Level Insensitive Protocol Transactions are encoded in go / ack events Asynchronously passes instructions between stages go ack data_1data_2 Transaction 2Transaction 1 go ack data Sender Stage Receiver Stage

10 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results

11 DSEP Microarchitecture At a high-level: 9 stage dynamic pipeline Multiple instruction issue Multiple functional units Out-of-order execution Looks like Intel P6 µarch What’s the difference? Decoupled, Self-Timed, Elastic Pipeline RetireRetire Results Retire Write back Commit Flush From I–Cache Fetch Decode Rename Read/Reorder Issue Execute Data Read

12 DSEP Microarchitecture Decoupled: Each stage controls its own latency  Based on local critical path  Stage balancing not important Each stage can have several different latencies  Selection based on inputs Decoupled, Self-Timed, Elastic Pipeline RetireRetire Results Retire Write back Commit Flush From I–Cache Fetch Decode Rename Read/Reorder Issue Execute Data Read Pipeline is operating at several different speeds simultaneously!

13 Pipeline Elasticity Definition: Pipeline’s ability to stretch with the latency of its instruction stream Global Elasticity Provided by reservation stations and reorder buffer Same for synchronous and asynchronous pipelines FetchExecuteRetire When Execute stalls, the buffers allow Fetch and Retire to keep operating

14 Pipeline Elasticity Local Elasticity Needed for a completely decoupled pipeline Provided by micropipelines  Variable length queues between stages  Efficient implementation, little overhead  Behave like shock absorbers RetireRetire Results Retire Write back Commit Flush From I–Cache Fetch Decode Rename Read/Reorder Issue Execute Data Read

15 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Results

16 Analysis Synchronous Processor Each stage runs at the speed of the worst-case stage running its worst-case operation Designer: Focus on critical paths, stage balancing DSEP Each stage runs at the speed of its own average operation Designer: Optimize for most common operation  Fundamental advantage of Decoupled Pipeline

17 Average-Case Optimizations Designer’s Strategy: Implement fine grain latency tuning Avoid latency of untaken paths Consider a generic example: If short op is much more common, throughput is proportional to the select logic Generic Stage Outputs Inputs MUXMUX Long operation Short operation Select logic

18 Average-Case ALU Tune ALU latency to closely match the input operation ALU performance is proportional to the average op Computational Circuit is unchanged ALU Self-Timing Circuit Arithmetic Logic Shift Compare Inputs ALU Computational Circuit SELSEL XORXOR Start Output Done

19 Average-Case Decoder Tune Decoder latency to match the input instruction Common instructions often have simple encodings Prioritize most frequent instructions Decoder Self-Timing Circuit Format 1 Format 3 Format 2 Input Decoder Computational Circuit SELSEL XORXOR Start Output Done

20 Average-Case Fetch Alignment Optimize for aligned fetch blocks If the fetch block is aligned on a cache line, it can skip alignment and masking overhead Optimized Fetch Alignment Fetch Block MUXMUX FetchAlign/Mask Aligned? Address Inst. Block Optimization is effective when software/hardware alignment optimizations are effective

21 Average-Case Cache Access Optimize for consecutive reads to the same cache line Allows subsequent references to skip cache access Optimized Cache Access Cache Line MUXMUX Read line from cache To Same Line? Address Previous line Effective for small stride access patterns, tight loops in I-Cache Very little overhead for non-consecutive references

22 Average-Case Comparator Optimize for the case that a difference exists in the lower 4 bits of the inputs 4-bit comparison is > 50% faster than 32-bit Optimized Comparator MUXMUX 32-bit Compare 4-bit Compare  ? ? Inputs Output Very effective for iterative loops  Can be extended for tag comparisons

23 Outline Introduction & Motivation Background DSEP Design Average Case Optimizations Experimental Evaluation

24 Simulation Environment VHDL Simulator using Renoir Design Suite MIPS I Instruction set Fetch and Retire Bandwidth = 1 Execute Bandwidth ≤ 4 4-entry split Instruction Window 64-entry Reorder Buffer Benchmarks BS  50-element bubble sort MM  10x10 integer matrix multiply

25 Two Pipeline Configurations OperationDSEP LatenciesFixed Latencies Fetch Decode50/80/ Rename80/120/ Read120 Execute20/40/80/100/ 130/150/360/ /360/600 Retire5/100/ Caches Main Memory960 Micropipeline Register55 “Synchronous” Clock Period = 120 time units

26 DSEP Performance Compared Fixed and DSEP configurations DSEP increased performance 28% and 21% for BS and MM respectively Execution Time

27 Micropipeline Performance Goals: Determine the need for local elasticity Determine appropriate lengths of the queues Method: Evaluate DSEP configurations of form AxBxC  A  Micropipelines in Decode, Rename and Retire  B  Micropipelines in Read  C  Micropipelines in Execute All configurations include fixed length instruction window and reorder buffer

28 Measured percent speedup over 1x1x1 2x2x1 best for both benchmarks  2.4% performance improvement for BS, 1.7% for MM  Stalls in Fetch reduced by 60% for 2x2x1 Micropipeline Performance Bubble-Sort Matrix-Multiply Percent Speedup

29 OOO Engine Utilization Measured OOO Engine utilization Instruction Window (IW) and Reorder Buffer (RB) Utilization = Avg # of instructions in the buffer IW-Utilization up 75%, RB-Utilization up 40% Instruction Window Reorder Buffer Utilization

30 Total Performance Compared Fixed and DSEP configurations DSEP 2x2x1 increased performance 29% and 22% for BS and MM respectively Execution Time

31 Conclusions Decoupled, Self-Timing Average-Case optimizations significantly increase performance Rarely taken critical paths no longer matter Elasticity Removes pipeline jitter from decoupled operation Increases utilization of existing resources Not as important as Average-Case Optimizations (At least for our experiments)

32 Questions?