1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.

Slides:



Advertisements
Similar presentations
The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Computer Abstractions and Technology
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
A Design Space Evaluation of Grid Processor Architecture Jiening Jiang May, 2005 The presentation based on the paper written by Ramadass Nagarajan, Karthikeyan.
Introduction CS 524 – High-Performance Computing.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
General Purpose Processors as Processor Arrays Peter Cappello UC, Santa Barbara.
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)
Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
RISC CSS 548 Joshua Lo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
The University of Adelaide, School of Computer Science
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Outline Why this subject? What is High Performance Computing?
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
CS203 – Advanced Computer Architecture Performance Evaluation.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Organization and Architecture Lecture 1 : Introduction
CS203 – Advanced Computer Architecture
What is *Computer Architecture*
Lecture 2: Performance Evaluation
Lynn Choi School of Electrical Engineering
ECE354 Embedded Systems Introduction C Andras Moritz.
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
A Common Machine Language for Communication-Exposed Architectures
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Flow Path Model of Superscalars
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Performance of computer systems
General Purpose Processors as Processor Arrays
Performance of computer systems
Computer Evolution and Performance
What is Computer Architecture?
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
What is Computer Architecture?
Chapter 4 Multiprocessors
Performance of computer systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal

2 Processor Model Stable model for last few decades –Von Neumann architecture –Sequentially execute instructions –Simple abstraction –Easy to program

3 Change Is Around the Corner Processor performance not scaling as before –Wire delay and power How to effectively use transistors? old view: chip looks small to a wire chip size distance signal can travel in 1 cycle new view: chip looks much bigger to a wire, communication is expensive even on chip!

4 Spatially-Aware Architectures Many forward looking architectures are addressing the physical challenges –MIT Raw processor –MIT Scale processor –Stanford Imagine processor –Stanford Smart Memories processor –UC David Synchroscalar –UT Austin TRIPS processor –Wisconsin ILDP architecture –The original IBM BlueGene processor

5 Problems with Monolithic Designs Super-wide general purpose processors are no longer practical Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net control Centralized control with global operand routing Area, power, and frequency concerns

6 ALU Bypass Net RF Spatial Architectures

7 ALU RF Bypass Net Spatial Architectures

8 ALU RF Spatial Architectures

9 ALU RF >> + Exploiting Locality

10 Raw On-Chip Networks 2 Static Networks –Software configurable crossbar –3 cycle latency for nearest- neighbor ALU to ALU –Must know pattern at compile-time –Flow controlled 2 Dynamic Networks –Header encodes destination –Fire and Forget –15 cycle latency for nearest-neighbor Computation Resources Switch Processor

11 ALU RF Distribute the Register File

12 ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ Distribute the Rest

13 ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ Tiled-Processor Architecture

14 Tile abstraction is quite powerful –e.g., power → r esources used as necessary Easily scalable All signals registered at tile boundaries, no global signals –Easier to Tune the Frequency –Easier to do the Physical Design –Easier to Verify Tiled-Processor Architecture Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer

15 Static Router Fetch Unit Compute Processor Fetch Unit Compute Processor Data Cache Close-up of a Single Raw Tile

16 The MIT Raw Processor 180 nm ASIC (IBM SA-27E) 16 tiles → 16 issue Core Frequency: V V Frequency competitive with IBM-implemented PowerPCs in same process 18 W (vpenta)

17 The Raw Goal Create an architecture that –Scales to 100’s-1000’s of functional units, memory ports By exploiting custom-chip like features –Application-specific routing of operands –Is “general purpose” (Versatile ) Run ILP sequential programs, scientific computations, server-style processing, streaming systems, and bit-level applications Support standard General Purpose Abstractions –Context switching, caching and instruction virtualization

18 The New Performance Goal Architecture and Application Space Figure borrowed from DARPA PCA Forum Versatility Selectable Virtual Machines Performance Desktop ILP ASIC Bit-level DSP Stream Server Throughput Raw architecture as an “all-purpose” processor –Better SPECmark/Watt across the board Higher SPECmark → think more MIPS compared to some reference machine (e.g., VAX 11/780)

19 Application Domains 5 market-dominant application domains –Desktop Integer –Desktop Floating (Scientific codes) –Server (Throughput Based) Ergonomic simulations, Grid computation, Transaction processing –Embedded Streaming –Embedded Bit-Level

20 How Applications Differ data spatial locality data temporal locality desktop integer desktop floating point (scientific) server streaming bit-level

21 Distinguishing Application Domains Five basis properties –Data temporal locality Quantify address reuse –Spatial temporal locality Quantify address adjacency –Predominant data type –Parallelism ILP, DLP, TLP, etc –Instruction temporal locality Inverse of control complexity

22 Parallelism Data Temporal Locality Data Spatial Locality Data Type Instruction Temporal Locality Classifying Applications Quantitative metrics for the basis properties –Measure properties of different applications Cluster applications into domains –VersaBench

23 VersaBench Status 15 total benchmarks –3 per category Drawn from SPEC INT/FP, Raw, StreamIt, DIS (AAEC), USC ISI –Manageable size, encourages evaluation using the entire suite –Available online at Benchmarks selected systematically –MIT Technical Memo 646, June 2004 Rabbah, Bratt, Asanovic, Agarwal

24 SPECmark (SPEC) Geometric Mean of Speedup relative to a single reference machine Versatility (VersaBench) Geometric Mean of Speedup relative to best performing machines Proposed Metric: Versatility Normalization to the best performing machines identifies areas for improvements –This is especially important → VersaGraphs –Not another mean over N benchmarks High Versatility mark implies architecture is good across the board

25 VersaGraph Example speedup relative to best machine desktop integer bit-level streamserver desktop float speedup of an ideal machine

26 VersaGraph Example speedup relative to best machine desktop integer bit-level streamserver desktop float speedup of a general purpose machine (e.g., P3)

27 VersaGraph Example speedup relative to best machine desktop integer bit-level streamserver desktop float speedup of an ASIC

28 VersaGraphs For Real Architectures P3 (600 MHz) P4 (2.8 GHz) Raw (425 MHz) integerscientificserverstreambit-level mcf, parserbmm, vpentamgrid, dbmscorner, radio80211a, 8b10b Also compared against Athlon 64 and Itanium 2

29 Raw Homepage download papers, benchmarks, …