Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

COMP25212 Advanced Pipelining Out of Order Processors.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Chapter 12 Pipelining Strategies Performance Hazards.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.

Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Overview Parallel Processing Pipelining

CS5102 High Performance Computer Systems Thread-Level Parallelism

Lecture 23: Interconnection Networks

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Multiscalar Processors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Assembly Language for Intel-Based Computers, 5th Edition

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

/ Computer Architecture and Design

Flow Path Model of Superscalars

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

Hyperthreading Technology

The University of Adelaide, School of Computer Science

Parallel and Multiprocessor Architectures

Instruction Level Parallelism and Superscalar Processors

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

Out of Order Processors

The University of Adelaide, School of Computer Science

Architecture & Organization 1

Levels of Parallelism within a Single Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Static Code Scheduling

Computer Evolution and Performance

RAW Scott J Weber Diagrams from and summary of:

Latency Tolerance: what to do when it just won’t go away

The Vector-Thread Architecture

CS 6290 Many-core & Interconnect

Levels of Parallelism within a Single Processor

Chapter 12 Pipelining and RISC

The University of Adelaide, School of Computer Science

CSC3050 – Computer Architecture

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessors and Multi-computers

Presentation transcript:

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology

Motivation As a thought experiment, let’s examine the Itanium II, published in last year’s ISSCC: 6-way issue Integer Unit < 2% die area Cache logic > 50% die area INT6 Cache logic

Hypothetical Modification Why not replace a small portion of the cache with additional issue units? “30-way” issue micro! Integer Units still occupy less than 10% area INT6 > 42 % cache logic INT6 INT6 INT6 INT6

Can monolithic structures like this be attained at high frequency? The 6-way integer unit in Itanium II already spends 50% of its critical path in bypassing. [ISSCC 2002 – 25.6] Even if dynamic logic or logarithmic circuits could be used to flatten the number of logic levels of these huge structures –

...wire delay is inescapable 180 nm 45 nm 1 cycle Ultimately, wire delay limits the scalability of un-pipelined, high-frequency, centralized structures.

One solution: Chip multiprocessors e.g., IBM’s two-core Power4 Research and Commercial multiprocessors have been designed to scale to 1000’s of ALUs These multiprocessors scale because they don’t have any centralized resources.

Multiprocessors: Not Quite Appropriate for ILP High cost of inter-node operand routing 10’s to 100’s of cycles to transfer the output of one instruction to the input of an instruction on another node Vast difference between local and remote communication costs ( 30x )... .. forces programmers and compilers to use entirely different algorithms at the two levels

An alternative to a CMP: a distributed microprocessor design Such a microprocessor would distribute resources to varying degrees: Partitioned register files, Partitioned ALU clusters, Banked caches, Multiple independent compute pipelines, ... even multiple program counters

Some distributed microprocessor designs Conventional Alpha 21264 – integer clusters Radical Proposals UT Austin’s Grid, Wisconsin’s ILDP and Multiscalar MIT’s Raw and Scale, Dynamic Dataflow, TTA, Stanford Smart Memories...

Some distributed microprocessors designs Interesting Secondary Development: The centralized bypass network is being replaced by a more general, distributed, interconnection network!

Artist’s View Distributed Resources I$ RF D$ I$ RF D$ I$ RF D$ I$ RF ld a Distributed Resources ld b + >> 3 I$ RF D$ I$ RF D$ * st b I$ RF D$ I$ RF D$ Sophisticated Interconnect

How are these networks different than existing networks? Designed to join operands and operations in space: Route scalar values, not multi-word packets Ultra-Low latency Ultra-Low occupancy Unstructured communication patterns In this paper, we call these networks “scalar operand networks”, whether centralized or distributed.

What can we do to gain insight about the scalar operand networks? Looking at a existing systems and proposals, Try to figure out what’s hard about these networks Find a way to classify them Gain a quantitative understanding

5 Challenges for Scalar Operand Networks Delay Scalability - ability of a design to maintain high frequencies as that design scales

Challenge 1: Delay Scalability Intra-component  Structures that grow as the system scales become bottlenecked by both interconnect delay and logic depths Register Files Memories Selection Logic Wakeup Logic ....

Challenge 1: Delay Scalability Intra-component  Structures that grow as the system scales become bottlenecked by both interconnect delay and logic depths Register Files Memories Selection Logic Wakeup Logic .... Solution: Pipeline the structure Turn propagation delay into pipeline latency Example: Pentium 4 pipelines regfile access Solution: Tile

Challenge 1: Delay Scalability Intra-component Inter-component  Problem of wire delay between components Occurs because it can take many cycles for remote components to communicate

Challenge 1: Delay Scalability Intra-component Inter-component  Problem of wire delay between components Occurs because it can take many cycles for remote components to communicate  Solution: Decentralize Each component must operate with only partial knowledge. Assign time cost for transfer of non-local information. Examples: ALU outputs, stall signals, branch mispredicts, exceptions, memory dependence info Examples: Pentium 4 wires, 21264 int. clusters

5 Challenges for Scalar Operand Networks Delay Scalability - ability of design to scale while maintaining high frequencies Bandwidth Scalability - ability of design to scale without inordinately increasing the relative percentage of resources dedicated to interconnect

Challenge 2: Bandwidth Scalability Global broadcasts don’t scale Example: Snoopy caches Superscalar Result Buses Problem: Each node has to process some sort of incoming data proportional to the total number of nodes in the system.

Challenge 2: Bandwidth Scalability Global broadcasts don’t scale Example: Snoopy caches Superscalar Result Buses Problem: Each node has to process some sort of incoming data proportional to the total number of nodes in the system. The delay can be pipelined ala Alpha 21264, but each node still has to process too many incoming requests each cycle. Imagine a 30-way issue superscalar where each ALU has its own register file copy. 30 writes per cycle!

Challenge 2: Bandwidth Scalability Global broadcasts don’t scale Example: Snoopy caches Superscalar Result Buses Problem: Each node has to process some sort of incoming data proportional to the total number of nodes in the system. The delay can be pipelined ala Alpha 21264, but each node still has to process too many incoming requests. Solution: Switch to a directory scheme Replace bus with point-to-point network Replace broadcast with unicast or multicast Decimate bandwidth requirement

Challenge 2: Bandwidth Scalability A directory scheme for ILP?!!! Isn’t that expensive? Directories store dependence information, in other words, the locations where an instruction should send its result Fixed Assignment Architecture:  Assign each static instruction to an ALU at compile time  Compile dependent ALU locations w/ instrs. The directory is “looked up” locally when the instruction is fetched.

Challenge 2: Bandwidth Scalability A directory scheme for ILP?!!! Isn’t that expensive? Directories store dependence information, in other words, the locations where an instruction should send its result Fixed Assignment Architecture:  Assign each static instruction to an ALU at compile time  Compile dependent ALU locations w/ instrs. The directory is “looked up” locally when the instruction is fetched. Dynamic Assignment Architecture:  Harder, somehow we have to figure out which ALU owns the dynamic instruction that we are sending to. True directory lookup may be too $$$.

5 Challenges for Scalar Operand Networks Delay Scalability - ability of design to scale while maintaining high frequencies Bandwidth Scalability - ability of design to scale without inordinately increasing the relative percentage of resources dedicated to interconnect Deadlock and Starvation - distributed systems need to worry about over-committing internal buffering example: dynamic dataflow machines “throttling” Exceptional Events - Interrupts, branch mispredictions, exceptions

5 Challenges for Scalar Operand Networks Delay Scalability - ability of design to scale while maintaining high frequencies Bandwidth Scalability - ability of design to scale without inordinately increasing the relative percentage of resources dedicated to interconnect Deadlock and Starvation Exceptional Events Efficient Operation-Operand Matching - Gather operands and operations to meet at some point in space to perform a dataflow computation

Challenge 5: Efficient Operation-Operand Matching The rest of this talk! If operation-operand matching is too expensive, there’s little point to scaling. Since this is so important, let’s try to come up with a figure of merit for a scalar operand network -

What can we do to gain insight about scalar operand networks? Looking at a existing systems and proposals, Try to figure out what’s hard about these networks  Find a way to classify the networks Gain a quantitative understanding

Defining a figure of merit for operation-operand matching 5-tuple <SO, SL, NHL, RL, RO>: Send Occupancy Send Latency Network Hop Latency Receive Latency Receive Occupancy tip: Ordering follows timing of message from sender to receiver

The interesting region conventional <10, 30, 5,30,40> distributed multiprocessor Superscalar < 0, 0, 0, 0, 0> (not scalable)

Raw: Experimental Vehicle 16 instructions per cycle (fp, int, br, ld/st, alu..) no centralized resources ~250 Operand Routes / cycle Two applicable on-chip networks - message passing - dedicated scalar operand network Scalability story: tiles registered on input, just add more tiles Simulations are for 64 tiles, prototype has 16

The interesting region conventional <10, 30, 5,30,40> distributed multiprocessor Superscalar < 0, 0, 0, 0, 0> (not scalable)

Two points in the interesting region conventional <10, 30, 5,30,40> distributed multiprocessor Raw / msg passing < 3, 2, 1, 1, 7> Raw / scalar < 0, 1, 1, 1, 0> Superscalar < 0, 0, 0, 0, 0> (not scalable)

Message Passing 5-tuple <3, (Using Raw’s on-chip message passing network) Three wasted cycles per send Sender Occupancy = 3 compute value send header send message send sequence # send value use the value

Message Passing 5-tuple <3,2, Two cycles for message to exit proc  Sender Latency = 2 (Assumes early commit point) compute value send header send sequence # send value use the value

Message Passing 5-tuple <3,2,1, Messages take one cycle per hop  Per-hop latency = 1 compute value send header send sequence # send value use the value

Message Passing 5-tuple <3,2,1,1, One cycle for message to enter proc  Receive Latency = 1 compute value send header send sequence # send value use the value

Message Passing 5-tuple <3,2,1,1,7> Seven wasted cycles for receive  Receive Occupancy = 7 (minimum) load tag branch if set demultiplex message compute value get sequence # send header compare # send sequence # branch if not eq send value use the value

Raw’s 5-tuple <0, Zero wasted cycles per send Sender Occupancy = 0 compute, send value use the value

Raw’s 5-tuple <0,1, One cycles for message to exit proc  Sender Latency = 1 compute value compute, send value use the value

Raw’s 5-tuple <0,1,1, Messages take one cycle per hop  Per-hop latency = 1 compute, send value use the value

Raw’s 5-tuple <0,1,1,1, One cycle for message to enter proc  Receive Latency = 1 compute, send value use the value

Raw’s 5-tuple <0,1,1,1,0> No wasted cycles for receive  Receive Occupancy = 0 compute, send value use the value

Superscalar 5-tuple <0, Zero wasted cycles for send  Send Occupancy = 0 compute, send value use the value

Superscalar 5-tuple <0,0,0,0, Zero cycles for all latencies  Send, Hop, Receive Latencies = 0 compute, send value use the value

Superscalar 5-tuple <0,0,0,0,0> No wasted cycles for receive  Receive Occupancy = 0 compute, send value use the value

Superscalar 5-tuple, late wakeup Wakeup signal will usually have to be sent ahead of time. If it’s not, then the 5-tuple could be <0,0,0,1,0>. use the value compute, send value Wakeup, select

5-tuples of several architectures Superscalar <0, 0,0, 0,0> Message Passing <3, 2+c,1, 1,7> <3, 3+c,1, 1,12> Distributed Shared Memory (F/E bits) <1,14+c,2,14,1> Raw <0, 1,1, 1,0> ILDP <0, n,0, 1,0> (n = 0, 2) Grid <0, 0,n/8, 0,0> (n = 0..8)

What can we do to gain insight about scalar operand networks? Looking at a existing systems and proposals, Try to figure out what’s hard about these networks Find a way to classify the systems  Gain a quantitative understanding

5-tuple Simulation Experiments Raw’s actual scalar operand network Raw + Magic parameterized scalar operand network - Each tile has FIFOs connected to every other tile. - Allows us to vary latencies and measure contention <0,1,1,1,0> Raw <0,1,1,1,0> Magic Network ..Vary all 5 parameters.. <1,14,2,14,0> Magic Network, Shared Memory Costs <3,3,1,1,12> Magic Network, Message Passing Costs ..and others

Raw’s Scalar Operand Network i.e., <0,1,1,1,0> 2 4 8 16 32 64 cholesky 1.62 3.23 6.00 9.19 11.90 12.93 vpenta 1.71 3.11 6.09 12.13 24.17 44.87 mxm 1.93 3.73 6.21 8.90 14.84 20.47 fppp-kernel 1.51 3.34 5.72 6.14 5.99 6.54 sha 1.12 1.96 1.98 2.32 2.54 2.52 swim 1.60 2.62 4.69 8.30 17.09 28.89 jacobi 1.43 2.76 4.95 9.30 15.88 22.76 life 1.81 3.37 6.44 12.05 21.08 36.10 Speedup versus 1 Tile

Impact of Receive Occupancy, 64 tiles, i.e., <0,1,1,1,n>

Impact of Receive Latency, 64 tiles, i.e., <0,0,0,n,0> magic

Impact of Contention i. e Impact of Contention i.e. Magic <0,1,1,1,0> / Raw’s <0,1,1,1,0>

Superscalar Raw Grid ILDP 5-Tuple <0,0,0,0,0> <0,1,1,1,0> <0,1,N/8,1,0> <0,N,0,1,0> Operand Transport Point to Point Point to Point Broadcast Broadcast Message Demultiplex Associative Instr Window Compile Time Scheduling Distributed Associative Instr Window F/E bits on distributed register files Compile Time Ordering Runtime Ordering Compile Time Ordering Intranode instr. order Runtime ordering Free intranode bypassing yes yes no yes Instr distribution Dynamic Assignment Compiler Assignment Compiler Assignment Dynamic Assignment

Open Scalar Operand Network ?’s Can we beat <0,1,1,1,0> ? How close can we get to <0,0,0,0,0> ? How do we prove that our <0,0,,0,0> scalar operand network would have a high frequency?

Open Scalar Operand Network ?’s Can we build scalable dynamic-assignment architectures for load-balancing? Is there a penalty for the 5-tuple? What is the impact of run-time vs. compile-time routing on the 5-tuple? What are the benefits of heterogeneous scalar operand networks? For instance, a <0,1,2,1,0> of 2-way <0,0,0,0,0>’s. Can we generalize these networks to support other models of computation, like streams or SMT-style threads?

More Open ?’s How do we design low energy scalar operand networks ? How do we support speculation on a distributed scalar operand network ? How do compilers need to change?

Summary 5 Challenges Delay Scalability Bandwidth Scalability Deadlock / Starvation Exceptions Efficient Operation-Operand Matching 5 tuple model Quantitative Results Mentioned open questions