Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008

Slides:

Advertisements

Similar presentations

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Instruction-Level Parallelism (ILP)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Dataflow Processors ECE/CS 757 Spring 2007 J. E. Smith Copyright (C) 2007 by James E. Smith (unless noted otherwise) All rights reserved. Except for use.

5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.

Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Use of Pipelining to Achieve CPI < 1

INTRODUCTION TO MULTISCALAR ARCHITECTURE

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Advanced Architectures

15-740/ Computer Architecture Lecture 3: Performance

CS203 – Advanced Computer Architecture

Concepts and Challenges

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Prof. Onur Mutlu Carnegie Mellon University

Morgan Kaufmann Publishers

Advanced Topic: Alternative Architectures Chapter 9 Objectives

ECE/CS 552: Pipelining to Superscalar

Greg Stitt ECE Department University of Florida

CDA 3101 Spring 2016 Introduction to Computer Organization

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Pipelining: Advanced ILP

Lecture 6: Advanced Pipelines

CS 152 Computer Architecture & Engineering

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Pipelining in more detail

Pipeline control unit (highly abstracted)

Chapter 5: Computer Systems Organization

Guest Lecturer TA: Shreyas Chand

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

COMS 361 Computer Organization

Pipeline control unit (highly abstracted)

ECE/CS 552: Pipelining to Superscalar

Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg

Instruction Set Principles

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Loop-Level Parallelism

Chapter 11 Processor Structure and function

rePLay: A Hardware Framework for Dynamic Optimization

Spring 2019 Prof. Eric Rotenberg

Stream-based Memory Specialization for General Purpose Processors

Chapter 4 The Von Neumann Model

Presentation transcript:

Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008 WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008

Why Dataflow? “[...] as wire delays grow relative to gate delays, improvements in clock rate and IPC become directly antagonistic” [Agarwal00] Large bypass networks, highly associative structures especially problematic Can only ameliorate somewhat in superscalar designs (21264 clustering, WIB, etc.) Shorter wires, smaller loads => higher fclk possible with point-to-point networks, decentralized structures

Dataflow Locality Def: predictability of instruction dependencies 3/5 source operands come from most recent producer Completely ignored by most superscalars Over-general design: large bypass networks, regular references to huge PRF, ... Partial exceptions: clustering, hierarchical RFs P4: 1 cycle of 31 stages devoted to execution Can exploit to greatly cheapen communication

The von Neumann abstraction Elegant as it is, the von Neumann execution model is inherently sequential Control dependencies limit exploitable ILP considerably P4 again: 20 stage (!) branch misprediction loop Store/load aliasing hurts, too

Why Not Dataflow? Dataflow architectures may scale further, but... Who the hell wants to write a program in Id? For commercial adoption and future sanity, must support von Neumann memory semantics But ideally without fetch serialization

Enter WaveScalar WaveScalar: dataflow's new groove Enabled by process improvements: can integrate 2N processing elements (PEs) + nearby storage on-die “Cache-only” architecture (not in the COMA sense) Provides total load/store ordering Can be programmed conventionally ...without a program counter

WaveScalar ISA WaveScalar binary encodes the DFG ISA is RISCy, plus a few new primitives Control flow: ɸ insn implements the C ternary operator Similar to predication ɸ-1 insn conditionally sends data to one PE or another based upon Indirect-Send(arg,addr,offset) insn implements indirect jumps, calls, returns

WaveScalar ISA: Waves Wave === connected DAG, subset of DFG Can span multiple hyperblocks iff each insn executed at most once (no loops) Easily elongated with unrolling To disambiguate which dynamic insn is being executed by a PE, data values carry a wave number Wave numbers incremented by Wave-Advance insn Wave number assignment is not centralized!

WaveScalar ISA: Memory Ordering Wave-ordered memory Where possible, mem ops labeled with location within its wave: <predecessor,this,successor> Control flow may prohibit this; when unknown, '?' used as label Rule: no op with ? in succ. field may connect to an op with ? in pred. field Solution: memory-nops Result: memory has enough info to establish total load/store order

WaveCache: WaveScalar Implemented Grid of 211 PEs in clusters of 16 On each PE: control logic, IQ/OQ, ALU, buffering for 8 static insns Small L1 D$ per 4 clusters Traditional unified L2$ 1 StQ per 4 clusters Each wave bound to a StQ dynamically Intra-cluster comm: shared buses Inter-cluster: mesh?

WaveScalar ISA: Waves Wave === connected DAG, subset of DFG Can span multiple hyperblocks iff each insn executed at most once (no loops) Easily elongated with unrolling To disambiguate which dynamic insn is being executed by a PE, data values carry a wave number Wave numbers incremented by Wave-Advance insn Wave number assignment is not centralized!

Compilation Compilation basically same as for traditional arch. To the point that binary translation is possible Additional steps: inserting memory-nops, wave-advances converting branches to ɸ-1 Binaries larger Extra insns Larger insns (list of target PEs) ...but this is OK (no repeated fetch)

Program Load/Termination Loading As usual, program loaded by setting PC & incurring I$ miss Insn targets labeled "not-loaded" until those miss, as well In general, hopefully I$ misses are infrequent Must back up evicted insn's state (queues), restore new insn's state Probably need to invoke OS Termination OS purges all insns from all PEs

Execution Example And it's that simple! void s(char in[10], char out[10]) { int i = 0, j = 0; do { int t = in[i]; if(t) out[j++] = t; } while(i++ < 10); } And it's that simple!

Just Kidding... void s(char in[10], char out[10]) { int i = 0, j = 0; do { int t = in[i]; if(t) out[j++] = t; } while(i++ < 10); }

Unmapped......................and Mapped

How Well Does It Do? Methodology Benchmarks: SPEC and a few others Compiled for Alpha & binary-translated Fairness; better overall code generation But no WaveCache-specific optimizations Results reported in Alpha-equivalent IPC Fairness (WaveScalar has extra insns)

How Well Does It Do? Favorable comparison to superscalar 16-wide (!!), out-of-order |PRF|=|IW|=1024 Better IPC than TRIPS, but certainly lower fclk TRIPS limited by smaller execution units (hyperblocks vs. waves)

Other performance results Extra instruction overhead In terms of static code size: 20%-140% In terms of execution time: 10% Parallelism Input queue size 8 sets of input values sufficient for most programs Except for victims of parallelism explosion

Performance improvements Control speculation Baseline WaveCache: no branch prediction 47% perf. improvement with perfect prediction Memory Speculation Baseline WaveCache: no memory disambiguation 62% perf. improvement with perfect memory disambiguation Upshot: unrealistic, but lots of headroom 340% improvement with both

Analysis WaveScalar makes dataflow much more general- purpose Seems fast enough to spend the time implementing Good IPC; more clock period headroom Why isn't this the golden standard? Why are Swanson, Oskin no longer into dataflow?

Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008 Questions? Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008

Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008 WaveScalar Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008