Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Slides:



Advertisements
Similar presentations
The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
A tiled processor architecture prototype: the Raw microprocessor October 02.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
1 Lecture 3: Pipelining Basics Biggest contributors to performance: clock speed, parallelism Today: basic pipelining implementation (Sections A.1-A.3)
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
General Purpose Processors as Processor Arrays Peter Cappello UC, Santa Barbara.
1 Digital Space Anant Agarwal MIT and Tilera Corporation.
Chapter 17 Parallel Processing.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CSE 661 PAPER PRESENTATION
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
Pipelining and Parallelism Mark Staveley
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology.
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
CS 352H: Computer Systems Architecture
Topics to be covered Instruction Execution Characteristics
Multicores from the Compiler's Perspective A Blessing or A Curse?
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Simultaneous Multithreading
Parallel Programming By J. H. Wang May 2, 2017.
Assembly Language for Intel-Based Computers, 5th Edition
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Flow Path Model of Superscalars
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Parallel and Multiprocessor Architectures
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
General Purpose Processors as Processor Arrays
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
The Vector-Thread Architecture
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Mattan Erez The University of Texas at Austin
Presentation transcript:

Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable Superscalar “PE”->”PE” communication Free exploitation of parallelism Implicit Clean semantics Yes scalable No power efficient No (hw scheduler or compiler) (or VLIW)

What’s great about superscalar microprocessors?  It’s the networks! Fast low-latency tightly-coupled networks (0-1 cycles of latency, no occupancy) -For the lack of a better name let’s call them Scalar Operand Networks (SONs) - Can we incorporate the benefits of superscalar communication + multicore scalability -Can we build Scalable Scalar Operand Networks? (I agree with Jose: “We need low-latency tightly-coupled … network interfaces” – Jose Duato, OCIN, Dec 6, 2006) mul $2,$3,$4 add $6,$5,$2

The industry shift toward Multicore - attainable but hardly ideal SuperscalarMulticore “PE”->”PE” communication FreeExpensive exploitation of parallelism ImplicitExplicit Clean semantics YesNo scalable NoYes power efficient NoYes

SuperscalarMulticore “PE”->”PE” communication FreeExpensive exploitation of parallelism ImplicitExplicit Clean semantics YesNo scalable NoYes power efficient NoYes What we’d like – neither superscalar nor multicore Superscalars have fast networks and great usability Multicore has great scalability and efficiency

Why communication is expensive on multicore Multiprocessor Node 1 Multiprocessor Node 2 Transport Cost send overhead receive overhead send occupancy send latency receive occupancy receive latency

Multiprocessor SON Operand Routing Multiprocessor Node 1 send occupancy send latency Destination node name Sequence number Value Launch sequence Commit Latency Network injection

Multiprocessor SON Operand Routing Multiprocessor Node 2 receive occupancy receive latency receive sequence demultiplexing branch mispredictions injection cost.. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)

Defining a figure of merit for scalar operand networks 5-tuple : Send Occupancy Send Latency Network Hop Latency Receive Latency Receive Occupancy Tip: Ordering follows timing of message from sender to receiver We can use this metric to quantitatively differentiate SONs from existing multiprocessor networks…

Impact of Occupancy (“o” = so+ro) if (o * “surface area” > “volume”)  not worth it to offload: overhead too high (parallelism too fine-grained) Impact of Latency The lower the latency, the less work needed to keep myself busy waiting for answer  not worth it to offload: could have done it myself faster (not enough parallelism to hide latency) Proc 0 Proc 1 nothing to do

The interesting region Power4 (on-chip) Superscalar (not scalable)

SuperscalarMulticoreTiled Multicore PE-PE communication FreeExpensiveCheap exploitation of parallelism ImplicitExplicitBoth scalable NoYes power efficient NoYes (w/ scalable SON) Tiled Microprocessors (or “Tiled Multicore”)

SuperscalarMulticoreTiled Multicore Alu-Alu communication FreeExpensiveCheap exploitation of parallelism ImplicitExplicitBoth scalable NoYes power efficient NoYes

Superscalar CMP/multicore Tiled add scalable SON add scalability Transforming from multicore or superscalar to tiled

The interesting region Power4 (on-chip) Raw Tiled “Famous Brand 2” Superscalar (not scalable)

Scalability Problems in Wide Issue Microprocessors Control Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net

Area and Frequency Scalability Problems ALU Bypass Net RF ~N 3 ~N 2 N ALUs Ex: Itanium 2 Without modification, freq decreases linearly or worse.

Operand Routing is Global ALU Bypass Net RF >> +

Idea: Make Operand Routing Local ALU Bypass Net RF

ALU RF Bypass Net Idea: Exploit Locality

ALU RF Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

ALU RF >> + Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

Un-pipelined crossbar bypass Point-to-Point Routed Mesh Network Local BW~ N ½ ~ N Area~ N 2 ~ N Operand Transport Scaling – Bandwidth and Area We can route more operands per unit time if we are able to map communicating instructions nearby. Scales as 2-D VLSI For N ALUs and N ½ bisection bandwidth: as in conventional superscalar

Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs. Non-local Placement ~ N~ N ½ Locality- Driven Placement ~ N~ 1~ 1 Un-pipelined crossbar Point-to-Point Routed Mesh Network Latency bonus if we map communicating instructions nearby so communication is local.

Distribute the Register File ALU RF

ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC SCALABLE

More Scalability Problems Control Wide Fetch (16 inst) Unified Load/Store Queue PC

Distribute the rest: Raw – a Fully-Tiled Microprocessor ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

Tiles! ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

Tiles!

Tiled Microprocessors -fast inter-tile communication through SON -easy to scale (same reasons as multicore)

1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network. Outline

Raw Microprocessor Tiled scalable microprocessor Point-to-point pipelined networks 16 tiles, 16 issue Each 4 mm x 4mm tile: MIPS-style compute processor - single-issue 8-stage pipe - 32b FPU - 32K D Cache, I Cache 4 on-chip networks - two for operands - one for cache misses - one for message passing

Fetch Unit Instruction Cache Generalized Transport Networks Dynamic Router “GDN” Dynamic Router “MDN” Functional Units Execution Core Inter-tile Network Links Compute Processor Trusted Core Untrusted Core Inter-tile SON Instruction Cache Static Router Switch Processor Cross- bar Intra-tile SON Data Cache Raw Microprocessor Components Cross- bar

IF RF D ATL M1 M2 FP E U r26 r27 r25 r24 r26 r27 r25 r24 Raw Compute Processor Internals Ex: fadd r24, r25, r26

Tile-Tile Communication add $25,$1,$2

Tile-Tile Communication add $25,$1,$2 Route $P->$E

Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P

Tile-Tile Communication add $25,$1,$2 sub $20,$1,$25 Route $P->$E Route $W->$P

tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 RawCC assigns instructions to the tiles, maximizing locality. It also generates the static router instructions that transfer operands between tiles. Compilation

One cycle in the life of a tiled micro httpd 4-way automatically parallelized C program 2-thread MPI app Direct I/O stream into Scalar Operand Network mem Zzz... An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…

Tile 0 Tile 5 Tile 9 Tile 12 Tile 8Tile 13Tile 14Tile 11 Tile 4Tile 1Tile 2Tile 7 Tile 3 Tile 6 Tile 10 Tile 15 One Streaming Application on Raw very different traffic patterns than RawCC-style parallelization

Splitter FIRFilter Joiner Splitter Detector Magnitude FIRFilter Vec Mult Detector Magnitude FIRFilter Vec Mult Detector Magnitude FIRFilter Vec Mult Detector Magnitude FIRFilter Vec Mult Joiner Splitter FIRFilter Joiner Splitter Joiner Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector OriginalAfter fusion Auto-Parallelization Approach #2: Streamit Language + Compiler

FIRFilter Joiner Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector End Results – auto-parallelized by MIT Streamit to 8 tiles.

AsTrO Taxonomy: Classifying SON diversity ALU >> + Assignment ( Static/Dynamic) Transport (Static/Dynamic) Ordering (Static/Dynamic) + >> Is instruction assignment to ALUs predetermined? Are operand routes predetermined? Is the execution order of instructions assigned to a node predetermined? % & /

Static Dynamic Static Dynamic Static RawDyn Raw Scale TRIPS Static Dynamic ILDP WaveScalar Assignment Transport Ordering Microprocessor SON diversity using AsTrO taxonomy

1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network. Outline

Raw Chips October 02

Raw 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: V V Frequency competitive with IBM-implemented PowerPCs in same process. 18W average power

Raw motherboard Support Chipset implemented in FPGA

A Scalable Microprocessor in Action [Taylor et al, ISCA ’04]

Conclusions Scalability problems in general purpose processors can be addressed by tiling resources across a scalable, low-latency, low-occupancy scalar operand network (SON). These SONs can be characterized by a 5-tuple and the AsTrO classification. The 180 nm 16-issue Raw prototype shows the feasibility of the approach is feasible. 64+-issue is possible in today’s VLSI processes. Multicore machines could benefit by adding inter- node SON for cheap communication.

* * **