Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.

Slides:



Advertisements
Similar presentations
Final Project : Pipelined Microprocessor Joseph Kim.
Advertisements

Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Instruction-Level Parallelism (ILP)
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
A tiled processor architecture prototype: the Raw microprocessor October 02.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Digital Space Anant Agarwal MIT and Tilera Corporation.
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Intel IA-64 Architecture Chehun Kim Glenn Ramos. Contents *Pipelining - Stages of pipelining *Microprogramming *Interconnection Structures.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
CS-334: Computer Architecture
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Computer Organization CSC 405 Bus Structure. System Bus Functions and Features A bus is a common pathway across which data can travel within a computer.
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
Interconnection Structures
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Exercise 2 The Motherboard
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
CSE 661 PAPER PRESENTATION
Copyright © 2007 Heathkit Company, Inc. All Rights Reserved PC Fundamentals Presentation 3 – The Motherboard.
General Concepts of Computer Organization Overview of Microcomputer.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Processor and Memory Organisation By: Prof. Mahendra B. Salunke Asst. Prof., Department of Computer Engg, SITS, Pune-41 URL:
Computer Organization & Assembly Language © by DR. M. Amer.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Computer Architecture Introduction Lynn Choi Korea University.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 Chapter 2 Central Processing Unit. 2 CPU The "brain" of the computer system is called the central processing unit. Everything that a computer does is.
The Raw Architecture A Concrete Perspective Michael Bedford Taylor Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology.
Computer operation is of how the different parts of a computer system work together to perform a task.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Packet Switching on Raw
Basic Computer Organization
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
RAW Scott J Weber Diagrams from and summary of:
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
ADSP 21065L.
Presentation transcript:

Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffmann, P. Johnson, W. Lee, A. Saraf, N. Shnidman, V. Strumpen, Saman Amarasinghe, Anant Agarwal Raw Architecture Group Laboratory for Computer Science Massachusetts Institute of Technology

Creating a Scalable Microprocessor Motivation The Raw Architecture The Raw Prototype

Motivation INT6 As a thought experiment, let’s examine the Itanium II, published in last year’s ISSCC: 6-way issue Integer Unit < 2% die area Cache logic > 50% die area Cache logic

Hypothetical Microprocessor INT6 Why not replace a small portion of the cache with additional issue units? “30-way” issue micro! Integer Units still occupy less than 10% area > 42 % cache logic INT6

Scalability Problems Compound Fetch unit fetches 30 instructions every cycle Decode 30 instructions every cycle Check dependencies on 30 instructions every cycle Register file needs 60 read ports and 30 write ports -- and 100’s of physical registers just to keep all of the live values around Bypassing: a 30-input 60-output, 32(64)-bit crossbar L1 cache will need many read and write ports -- and like the register file, be larger

Can monolithic structures like this be attained at high frequency? The 6-way integer unit in Itanium II already spends 50% of its critical path in bypassing. [ISSCC 2002 – 25.6] Even if dynamic logic or logarithmic circuits could be used to flatten the number of logic levels of these huge structures –

...wire delay is inescapable 1 cycle 180 nm 45 nm Ultimately, wire delay limits the scalability of un-pipelined, high-frequency, centralized structures.

Raw addresses scalability by... -Fetch Unit -Decode -Register File -L1 Data Cache and Instruction Caches -Control -Stall signals -I/O, memory system, interrupts -Operand Bypassing The only centralized resource is a global clock. (we could get rid of that too)  Distributing everything over an interconnection network 

Idea: distribute everything over a generalized on-chip network I$RFD$I$RFD$I$RFD$I$RFD$ “Artists View” Generalized Interconnect Distributed Resources

The Raw Architecture Divide the silicon into an array of 16 identical, programmable tiles. (A signal can get through a small amount of logic and to the next tile in one cycle.)

The Raw Tile 8 stage 32b MIPS-style single-issue in-order compute processor Tile 4-stage 32b pipelined FPU 32 KB DCache 32 KB ICache Routers and wires for three on-chip mesh networks

Tiling has gotten us this far... Fetch Units Decoders Register Files L1 I/D Caches Control Stall signals I/O, memory system, interrupts Inter-tile Bypassing 16 instructions / cycle 32 read / 16 write ports 512 registers 32 read / write ports 1 MB Using the on-chip networks. Why Raw is not just a CMP. 16 instructions / cycle aggregate totals for a 16 tile processor

Raw’s three on-chip mesh networks Compute Pipeline Registered at input  longest wire = length of tile ( Mhz) 8 32-bit channels

Raw’s three on-chip networks “Memory Network” dynamic, dim. order wormhole routed cache misses, I/O, DMA, OS “General Network” dynamic, dim. order wormhole routed user-level, message passing programs “Scalar Operand Network” For communication of operands among local and remote ALUs

Raw’s Scalar Operand Network Consists of two tightly-coupled sub-networks: Tile interconnection network For communication of operands between tiles Controlled by the 16 tiles’ static router processors Local bypass network For communication of operands within a tile

Between tile operand transport The routes programmed into the static router ICache guarantee in-order delivery of operands between tiles at a rate of 2 words/direction/cycle. Compute Pipeline static router crossbar static router crossbar Compute Pipeline Static Router ICache JZ P, label route P->E,S FIFO Static Router ICache JZ W, label route W->P FIFO

IFRFD ATL M1M2 FP E U TV F4WB r26 r27 r25 r24 Input FIFOs from Static Router r26 r27 r25 r24 Output FIFOs to Static Router Ex: lb r25, 0x341(r26) 0-cycle “local bypass network” Operand Transport among functional units and the static router

Raw also distributes: Control Branch conditions are transmitted over scalar operand network. Each affected tile and static router executes a branch instruction. Tiles can have vastly different control flow. Stall signals Inter-tile stall conditions are conveyed only by stalls on input and output FIFOs, usually reflecting an operand dependence between the two tiles. I/O, memory system, interrupts Inter-tile Bypassing 3 cycles nearest neighbor 1 cycle per hop

Raw’s I/O and Memory System Raw chipset DRAM PCI x 2 DRAM D/A Routes on any network off the edge of the chip appear on the pins Gb/s channels ( Mhz) Tiles cache-miss independently to the DRAM that owns a given cache line. Using on-chip network, off-chip devices send data (DMA) to each other and interrupts to one or more tiles.

tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 The Raw compiler assigns instructions to the tiles, maximizing locality. It also generates the static router instructions that transfer operands between tiles. Compilation

A Raw System in Action httpd 4-way automatically parallelized C program 2-thread MPI app Direct I/O stream into Scalar Operand Network mem Zzz... Note that an application uses only as many tiles as needed to exploit the parallelism intrinsic to that application.

1 cycle 180 nm 90 nm Scalability to Larger Issue Width 16-issue64-issue Just stamp out more tiles! Longest wire, frequency, design and verification complexity all independent of issue width. Architecture is backwards compatible.

The Raw Prototype 16 tile processor in an ASIC process - Intended as an experimental prototype for future processor designs - Imagine a version implemented by a full-custom design team The architecture is targeted for systems from 16-issue to 1024-issue – this prototype is at the beginning of this range. We’re just coming to VLSI processes where this architecture starts to make sense.

18.2 mm x 18.2 mm 16 Flops/ops per cycle 208 Operand Routes / cycle 2048 KB L1 SRAM 1657 Pin CCGA Package 1080 HSTL core-speed signal I/O Raw ASIC IBM SA-27E.15u 6L Cu 3.6 Peak GFLOPS (without FMAC) 230 Gb/s on-chip bisection bandwidth 201 Gb/s off-chip I/O 225 MHz Worst Case (Temp, Vdd, process):

Close-up of a single Raw tile (design one and the rest will follow!)

A lot like a PC motherboard. 64-bit PCI slots, DIMM slots, PS/2 keyboard, ATX power, and..... twenty-eight 32-bit  225 MHz Connecting I/O Devices and Raw Chip. Raw Motherboard

Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip Raw Chip 128MB SDRAM 128MB SDRAM XCV 2000E FPGA 128MB SDRAM 128MB SDRAM XCV 2000E FPGA 128MB SDRAM 128MB SDRAM XCV 2000E FPGA 128MB SDRAM 128MB SDRAM XCV 2000E FPGA This 16 chip array would approximate a 256 tile Raw from a 45 nm process. Raw chips gluelessly connect to form larger virtual chips up to 32x32 tiles.

Summary Centralized structures are one of the key impediments to microprocessor scalability. The Raw architecture scales because it distributes everything over interconnection networks. - Raw uses a routed, point-to-point scalar operand network to transport operands among functional units with very low latency. We’ve designed and built a complete 16-issue prototype system, including a compiler, a chip, and a motherboard.