RAW Scott J Weber Diagrams from and summary of:

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
Chapter 17 Parallel Processing.
Multiscalar processors
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Computer performance.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
1 Chapter 7 Processor-in-Memory, Reconfigurable, and Asynchronous Processors.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
CSE 661 PAPER PRESENTATION
Processor Architecture
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Capability of processor determine the capability of the computer system. Therefore, processor is the key element or heart of a computer system. Other.
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
CS 352H: Computer Systems Architecture
Nios II Processor: Memory Organization and Access
Lynn Choi School of Electrical Engineering
Overview Parallel Processing Pipelining
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Baring It All to Software: Raw Machines
Microarchitecture.
Packet Switching on Raw
Multiscalar Processors
Pipelining and Retiming 1
Lynn Choi School of Electrical Engineering
A Common Machine Language for Communication-Exposed Architectures
Assembly Language for Intel-Based Computers, 5th Edition
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Architecture & Organization 1
Scalable Processor Design
Basic Computer Organization
/ Computer Architecture and Design
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
ELEN 468 Advanced Logic Design
Parallel and Multiprocessor Architectures
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
BIC 10503: COMPUTER ARCHITECTURE
ELEC 7770 Advanced VLSI Design Spring 2012 Introduction
Morgan Kaufmann Publishers Computer Organization and Assembly Language
ELEC 7770 Advanced VLSI Design Spring 2010 Introduction
Chapter 1 Introduction.
General Purpose Processors as Processor Arrays
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Evolution and Performance
What is Computer Architecture?
The Vector-Thread Architecture
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
CSC3050 – Computer Architecture
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
CSE378 Introduction to Machine Organization
Presentation transcript:

RAW Scott J Weber Diagrams from and summary of: [1] “Bringing It All to Software: Raw Machines” by Elliot Waingold, Michael Taylor, Devabhaktuni, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, Anant Agarwal. [2] “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs” by Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Jae-Wook Lee, Paul Johnson, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. [3] “The Raw Processor – A Scalable 32 bit Fabric for General Purpose and Embedded Computing” (presentation at Hot Chips XIII) by Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson, Walter Lee, Albert Ma, Nathan Shnidman, Volker Strumpen, David Wentzlaff, Matt Frank, Saman Amarasinghe, and Anant Agarwal.

Outline Architecture Goals Architecture Details Compilation Architecture Comparison

Architecture Goals What? How? Handle increasing wire delay Verification of new designs Stream-based multi-media computation How? Minimize the “ISA gap” Export the low-level details of the architecture to the compiler Access to gate, wire, and pin resources

“ISA gap” Gap between software-usable processing elements and actual underlying physical resources is steadily increasing 8-way superscalar 21464 is 27x as large as the 2-way 21064 Management dwarfs area of ALUs Intel has 300+ page document on how to avoid stalls in the Pentium 4 Power consumption, design and verification cost is sky-rocketing

Minimize the “ISA Gap” Minimize “ISA gap” by exposing physical resources as architectural entities

RAW Processor Tightly integrated synchronous network interface Register access latencies Static scheduling of operands Eliminates explicit synchronization Single-cycle message injection and reception Multigranular (bit-, byte- and word-level) operations Configurable logic for application-specific operations Pipelined, point-to-point network between registers of different tiles increases ILP SRAM distributed across tiles decreases memory latency High-bandwidth paths to external devices (DRAM, I/O) Separate control for processing elements and static routing

RAW Processor

Tile Processor Pipeline

Interconnect Architecture RAW Superscalar Multiprocessor RAW – switched point-to-point interconnect Superscalar – single register file and global bus Multiprocessor – coarser grained through memory

Programmable Interconnect Two compile time static networks 5-stage pipeline static routing processor 15 operations/cycle per tile small command and 14 routes Single-cycle/hop latency Two runtime dynamic networks Dimension-ordered, wormhole-routed “Memory” network with deadlock avoidance Operating system, data cache, interrupts, DMA, I/O “General” network relies on deadlock recovery Untrusted clients Static network: 2 routing crossbars  2 physical networks 7 entities (static router pipeline, North, East, South, West, Main Processor, and the other crossbar) in both directions  7 * 2 = 14

Architecture Comparison Systolic Arrays Point-to-point networks for static scheduling Startup costs are high and static scheduling for long periods is difficult FPGAs Fine-grained parallelism and fast static communication Does not support instruction sequencing VLIW Processors Large number of operands and compiler-dependent Does not support multiple instruction streams

Architecture Comparison Multiscalar processors Deceptively similar except for exposure Register renaming and dependency checking in hardware not software Single-chip multiprocessor Expensive message startup and synchronization IRAM architectures Long memory lines or long crossbar wires increase memory latency

Compilation Parallelizes C code onto static network View N tiles as functional units for ILP Partition parallel code to tiles Placement of threads to physical tiles Routing and global scheduling Logic synthesis for custom operations

Systolic Structures for (i = 0; i < n; i++) a[b[i]] = a[c[i]] Value dependency between a load and store of a Two-cycles/iteration RAW compiler handles memory dependencies

Results on RawLogic Prototype Life +32x bit-level operations +32x parallelism +22x configurability -3x slower FPGA clock -13x communication overhead 600x total speed-up Reference Processor: Sparc 20/71 RawLogic: Ikos VirtuaLogic Emulator coupled with Sparc 10/51, 5 boards, 64 Xilinx 4013 FPGAs/board (25 MHz)

RAW Tile

RAW Stats IBM SA-27E .15u 6L Cu 18.2 mm x 18.2 mm die .122 Billion Transistors 16 Tiles 2048 KB SRAM Onchip 1657 Pin CCGA Package (1080 HSTL signal IO) ~225 MHz ~25 Watts

Conclusion Does RAW solve all computation problems? Stream-based multi-media applications? Scientific computation?