The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Chapter 1. Basic Structure of Computers
3-Software Design Basics in Embedded Systems
Chapter 3 General-Purpose Processors: Software
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
A tiled processor architecture prototype: the Raw microprocessor October 02.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
1 Digital Space Anant Agarwal MIT and Tilera Corporation.
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)
Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.
ECE 526 – Network Processing Systems Design
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Processor Types And Instruction Sets Barak Perelman CS147 Prof. Lee.
COM181 Computer Hardware Ian McCrumRoom 5B18,
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
ECEn 191 – New Student Seminar - Session 8: Computer Systems ECEn 191 – New Student Seminar – Session 7: Computer Systems Computer Systems ECEn 191 New.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Chapter 1 Basic Structure of Computers. Chapter Outline computer types, structure, and operation instructions and programs numbers, arithmetic operations,
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
ECEn 191 – New Student Seminar - Session 9: Microprocessors, Digital Design Microprocessors and Digital Design ECEn 191 New Student Seminar.
COMP3221 lec04--prog-model.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lecture 4: Programmer’s Model of Microprocessors
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Spring 2007Lecture 16 Heterogeneous Systems (Thanks to Wen-Mei Hwu for many of the figures)
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Computer Organization and Design Computer Abstractions and Technology
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 3 General-Purpose Processors: Software.
General Concepts of Computer Organization Overview of Microcomputer.
Computer Architecture And Organization UNIT-II General System Architecture.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
3/29: Processors Roll Call Lecture: CPU’s Other (?)
Chapter Overview Microprocessors Replacing and Upgrading a CPU.
12/13/ _01 1 Computer Organization EEC-213 Computer Organization Electrical and Computer Engineering.
Electronic Analog Computer Dr. Amin Danial Asham by.
CPU/BIOS/BUS CES Industries, Inc. Lesson 8.  Brain of the computer  It is a “Logical Child, that is brain dead”  It can only run programs, and follow.
Lecture 3: Computer Architectures
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
CBP 2002ITY 270 Computer Architecture1 Module Structure Whirlwind Review – Fetch-Execute Simulation Instruction Set Architectures RISC vs x86 How to build.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
My Coordinates Office EM G.27 contact time:
1 TM 1 Embedded Systems Lab./Honam University ARM Microprocessor Programming Model.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Lynn Choi School of Electrical Engineering
Packet Switching on Raw
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
RAW Scott J Weber Diagrams from and summary of:
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
CSE 502: Computer Architecture
Presentation transcript:

The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT

A tiled processor architecture prototype: the Raw microprocessor October 02

Embedded system: 1020 Element Microphone Array

1020 Node Beamformer Demo. 2 People moving about and talking. Track movement of people using camera (vision group) and display on monitor. Beamformer focuses on speech of one person. Select another person using mouse. Beamformer switches focus to the speech of the other person

The opportunity 20MIPS cpu 1987

2007 The billion transistor chip The opportunity

Enables seeking out new application domains o Redefine our notion of a general purpose processor o Imagine a single-chip handheld that is a speech driven cellphone, camera, PDA, MP3 player, video engine o Imagine a single-chip PC that is also a 10G router, wireless access point, graphics engine o While running the gamut of existing desktop binaries -- A new versatile processor -- APU* *Asanovic

But, where is general purpose computing today? Other…Encryption Sound EthernetWireless Graphics 0.25TFLOPS X86 Pentium IV ASICs

How does the ASIC do it? Lots of ALUs, lots of registers, lots of small memories Hand-routed, short wires Lower power (everything close by) Stream data model for high throughput But, not general purpose mem

Our challenge How to exploit ASIC-like features Lots of resources like ALUs and memories Application-specific routing of short wires While being general purpose Programmable And even running ILP-based sequential programs One Approach: Tiled Processor Architecture (TPA)

Tiled Processor Architecture (TPA) SMEM SWITCH PC Short, prog. wires DMEM IMEM REG PC FPU ALU Lots of ALUs and regs Tile Lower power Programmable. Supports ILP and Streams

A Prototype TPA: The Raw Microprocessor The Raw Chip Software-scheduled interconnects (can use static or dynamic routing – but compiler determines instruction placement and routes) Tile Disk stream Video1 RDRAM Packet stream A Raw Tile SMEM SWITCH PC DMEM IMEM REG PC FPU ALU Raw Switch PC SMEM [Billion transistor IEEE Computer Issue 97]

Tight integration of interconnect The Raw Chip Tile Disk stream Video1 RDRAM Packet stream A Raw Tile SMEM SWITCH PC DMEM IMEM REG PC FPU ALU IF RF D ATL M1 FP E U WB r26 r27 r25 r24 Input FIFOs from Static Router r26 r27 r25 r24 Output FIFOs to Static Router 0-cycle local bypass network M2 TV F4 Point-to-point bypass-integrated compiler-orchestrated on-chip networks

Tile 11 fmul r24, r3, r4 software controlled crossbar software controlled crossbar fadd r5, r3, r25 route P->E route W->P Tile 10 How to program the wires

The result of orchestrating the wires Custom Datapath Pipeline mem httpd C program ILP computation MPI program Zzzz

Perspective We have replaced Bypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc. With a general, point-to-point, routed interconnect called: Scalar operand network (SON) Fundamentally new kind of network optimized for both scalar and stream transport

Programming models and software for tiled processor architectures o Conventional scalar programs (C, C++, Java) Or, how to do ILP o Stream programs

Scalar (ILP) program mapping v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval pval3 = seed.0 * v2.4 tmp2.5 = pval pval5 = seed.0 * 6.0 pval4 = pval tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 E.g., Start with a C program, and several transformations later: Lee, Amarasinghe et al, Space-time scheduling, ASPLOS 98 Existing languages will work

pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Scalar program mapping v2.4 = v2 seed.0 = seed v1.2 = v1 pval1 = seed.0 * 3.0 pval0 = pval tmp0.1 = pval0 / 2.0 pval2 = seed.0 * v1.2 tmp1.3 = pval pval3 = seed.0 * v2.4 tmp2.5 = pval pval5 = seed.0 * 6.0 pval4 = pval tmp3.6 = pval4 / 3.0 pval6 = tmp1.3 - tmp2.5 v2.7 = pval6 * 5.0 pval7 = tmp1.3 + tmp2.5 v1.8 = pval7 * 3.0 v0.9 = tmp0.1 - v1.8 v3.10 = tmp3.6 - v2.7 tmp2 = tmp2.5 v1 = v1.8; tmp1 = tmp1.3 v0 = v0.9 tmp0 = tmp0.1 v3 = v3.10 tmp3 = tmp3.6 v2 = v2.7 Graph Program code

pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Program graph clustering

Placement seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp0=tmp0.1 v1.2=v1 pval2=seed.0*v1.2 tmp1.3=pval2+2.0 tmp1=tmp1.3 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v3.10=tmp3.6-v2.7 v3=v3.10 v2.4=v2 pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 Tile1 Tile2 Tile3 Tile4

Routing Processor code seed.0=recv() pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(W,S,t) route(W,S) route(S,t) Tile2 v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 Send(v2.7) v2=v2.7 route(N,t) route(t,E) route(E,t) route(t,E) Tile3 v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 tmp0.1=recv() v0.9=tmp0.1-v1.8 v0=v0.9 route(N,t) route(t,W) route(W,t) route(N,t) route(W,N) Tile4 seed.0=seed send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,E,S) route(t,E) Tile1 Switch code

Instruction Scheduling seed.0=recv() pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v2.7=recv() v3.10=tmp3.6-v2.7 v3=v3.10 route(W,t) route(W,S) route(S,t) send(seed.0) pval1=seed.0*3.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 send(tmp0.1) tmp0=tmp0.1 route(t,E) v2.4=v2 seed.0=recv(0) pval3=seed.o*v2.4 tmp2.5=pval3+2.0 tmp2=tmp2.5 send(tmp2.5) tmp1.3=recv() pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 Send(v2.7) v2=v2.7 route(N,t) route(t,E) route(E,t) route(t,E) v1.2=v1 seed.0=recv() pval2=seed.0*v1.2 tmp1.3=pval2+2.0 send(tmp1.3) tmp1=tmp1.3 tmp2.5=recv() pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 route(N,t) route(W,N) seed.0=seed route(W,S) tmp0.1=recv() route(t,W) route(W,t) route(W,N) route(t,E) Tile1 Tile3 Tile4 Tile2 time

class BeamFormer extends Pipeline { void init(numChannels, numBeams) { add(new SplitJoin() { void init() { setSplitter(Duplicate()); for (int i=0; i<numChannels; i++) { add(new FIR1(N1)); add(new FIR2(N2)); } setJoiner(RoundRobin()); }}); } add(new SplitJoin() { void init() { setSplitter(Duplicate()); for (int i=0; i<numBeams; i++) { add(new VectorMult()); add(new FIR3(N3)); add(new Magnitude()); add(new Detect()); } setJoiner(Null()); }}); } StreamIt: Stream Language and Compiler Splitter FIRFilter Joiner Splitter Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Detector Magnitude FirFilter Vector Mult Joiner e.g., BeamFormer Amarasinghe et al.

mem Search FRM FIR Control Raw Beamformer Layout (by hand)

Raw die photo.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta) Of course, custom IC designed by industrial design team could do much better

Raw motherboard

More Experimental Systems Systems Online or in Pipeline Raw Workstation Raw-based 1020 Microphone Array Raw a/g wireless system (collab with Engim) Raw Gigabit IP router Raw graphics system Raw supercomputing fabric

Empirical Evaluation Compare to P3 implemented in similar technology ParameterRaw (IBM ASIC)P3 (Intel) Litho180 nm ProcessCMOS 7SFP858 Metal LayersCu 6Al 6 FO1 Delay23 ps11 ps Dielectric k Design StyleStandard Cell SA27E ASIC Full custom Initial Freq425 MHz MHz (use 600) Die Area331 mm mm 2 Raw #s from cycle-accurate simulator validated against real chip -- FPGA mem controller in Raw -- Raw SW i-caching adds 0-30% ovhd

[ISCA04] Performance Architecture Space Performance Results ~10x parallelism ~ 4x ld/store elim ~ 4x stream mem bw

Raw IP Router Gb/sec

VersaBench Sharing a benchmark set to stress versatility of processors Categories of programs: ILP – Desktop and Scientific Streams Throughput oriented servers Bit-level embedded

Summary Raw: single chip for ILP and streaming Scalar operand network is key to ILP and streams Can enable the APU