Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.

Slides:

Advertisements

Similar presentations

The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian FieldsRastislav BodíkMark D. Hill University of Wisconsin-Madison.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.

Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.

CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?

The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.

A Design Space Evaluation of Grid Processor Architecture Jiening Jiang May, 2005 The presentation based on the paper written by Ramadass Nagarajan, Karthikeyan.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

General Purpose Processors as Processor Arrays Peter Cappello UC, Santa Barbara.

1 Digital Space Anant Agarwal MIT and Tilera Corporation.

Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.

Gigabit Routing on a Software-exposed Tiled-Microprocessor

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

The Alpha – Data Stream Matt Ziegler.

Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.

Raw Status Update Chips & Fabrics James Psota M.I.T. Computer Architecture Workshop 9/19/03.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Versatile Tiled-Processor Architectures The Raw Approach Rodric M. Rabbah with Ian Bratt, Krste Asanovic, Anant Agarwal.

Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Lynn Choi School of Electrical Engineering

ECE354 Embedded Systems Introduction C Andras Moritz.

Packet Switching on Raw

Stateless Combinational Logic and State Circuits

A Common Machine Language for Communication-Exposed Architectures

A Quantitative Analysis of Stream Algorithms on Raw Fabrics

INTRODUCTION TO MICROPROCESSORS

Flow Path Model of Superscalars

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

Computer Architecture Lecture 4 17th May, 2006

IA-64 Microarchitecture --- Itanium Processor

General Purpose Processors as Processor Arrays

Computer Evolution and Performance

RAW Scott J Weber Diagrams from and summary of:

Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

COMS 361 Computer Organization

Presentation transcript:

Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. M.I.T.

Could processors be even more general purpose? Square inch of silicon Gets more powerful every generation Custom Chip “General Purpose” Microprocessor Video/3D Graphics Network Encryption Wireless/Cell Phone Digital Camera MP3 Player Automotive Why can custom chips run these apps? Spec Office

Custom Chips: Efficient Extraction of Parallelism 10’s, 100’s or 1000’s of parallel operators 10’s or 100’s of parallel memory ports 10’s or 100’s of parallel I/O ops But, not general purpose! Can’t run GCC. Customized placement and routing of operators & operands -High locality -Minimum Control -Operands routed over wires, not thru register files  Area and Power Efficient GP Micro

The Raw Goal Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands … while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions - like context switching, caching and instruction virtualization [IEEE Micro, “Billion Transistor” Issue, 1997]

Un-buildable Super-Wide Issue GP Control Wide Fetch (16 inst) Unified Load/Store Queue PC RF ALU Bypass Net

Area and Frequency Scalability Problems ALU Bypass Net RF ~N 3 ~N 2 N ALUs Ex: Itanium 2 Without modification, freq decreases linearly or worse.

Operand Routing is Global ALU Bypass Net RF >> +

Idea: Exploit Locality ALU Bypass Net RF

ALU RF Bypass Net Idea: Exploit Locality

ALU RF Replace the crossbar with a point-to-point, pipelined, routed network.

ALU RF >> + Replace the crossbar with a point-to-point, pipelined, routed network.

Un-pipelined crossbar Point-to-Point Routed Mesh Network ALUs N N Bisection BW~ N ½ Local BW~ N ½ ~ N Area~ N 2 ~ N Operand Transport Scaling – Bandwidth and Area If we want to keep our ALUs busy, we better map communicating instructions nearby so communication is local. Scales as 2-D VLSI

Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs. Non-local Placement ~ N~ N ½ Locality Driven Placement ~ N~ 1~ 1 Un-pipelined crossbar Point-to-Point Routed Mesh Network If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we must map the instructions to ALUs in a local fashion. [ASPLOS98]

Distribute the Register File ALU RF

ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC SCALABLE

More Scalability Problems Control Wide Fetch (16 inst) Unified Load/Store Queue PC

Distribute the rest. ALU RF Control Wide Fetch (16 inst) Unified Load/Store Queue PC I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ [ISCA99]

Tiles! ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$

Tiles!

Tiled Processor Architectures - composed of a replicated tile -all signals registered at tile boundaries -NO global signals -wire delay problem much easier - easy scalability story Easier to Tune the Frequency Easier to Verify Easier to do the Physical Design

Raw Compute Internals ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ IF RF D ATL M1 M2 FP E U r26 r27 r25 r24 r26 r27 r25 r24

ALU RF I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ I$ PC D$ We could not find this type of network in Patterson & Hennessey. - optimizes time for delivery of scalar operands between functional units - we conceptualized this idea into the term “scalar operand network” or SON - CMP: cycles - iWarp: 12 cycles - Raw: 3 cycles - Alpha 21264: 1 cycle - Superscalar: 0 cycle scalable HPCA 2003 – “Scalar Operand Networks” Intended for use as SON

Evaluation of Raw - holistic approach - design a complete architecture - design and build the processor and enclosing system - build the compilers - used the chip in real systems - head-to-head versus Intel Chip in same litho generation

Raw 180 nm ASIC (IBM SA-27E) 16 tiles Core Frequency: V V Frequency competitive with IBM-implemented PowerPCs in same process. 18 W (vpenta) Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop

Raw Chips October 02

Raw motherboard Support Chipset implemented in FPGA (vs. custom ASICs for P3)

Comparison to Pentium 3 Self-comparisons hide architectural and compiler inefficiency. What’s hard: Normalizations between processors is very tricky. Especially academic projects versus indu$try. - ASIC cannot attain the same frequencies. Honest: Our solution: -Pick closest Intel processor implementation -Don’t scale any numbers in any way. People can now compare to P3 and by extension to Raw.

ParameterIBM SA-27E (Raw)Intel P858 (P3)Favors Litho180 nm - Metal LayersCu 6Al 6Raw Wire sizingNoYesIntel Dielectric k Intel FO1 Delay23 ps11 psIntel Design StyleStd Cell ASICFull customIntel Voltage Tweak0 %10 %Intel Initial Freq Presumed Ave. Chip Freq Pins Raw Die Area331 mm mm 2 Raw

Methodology - HW Intel: Pentium III Coppermine 600 MHz Dell Precision 410, stocked with PC100 DRAM Raw: Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system with conventional hardware i-cache.

Methodology - SW When applicable - normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing) - normalize stdio/stdlib: P3 & Raw: Newlib w/ Deionizer P3: Intel Performance Primitives LAPACK/BLAS with SSEfor linear algebra routines Raw: rawcc - home brew parallelizing compiler Streamit - home brew parallelizing compiler gcc snippets inline assembly for some parallel apps

Performance Survey

Sources of Speedup vs. P3 or 1 Tile FactorApprox. Upper Bound on Speedup Tile Parallelism16x Streaming I/O Bandwidth60x Streaming v. cache thrashing15x

Future Work: Raw supercomputing fabric Emulator of a 1K-tile Raw chip circa …Ultimate test of scaling

Related Work: AsTrO Taxonomy ALU >> + Assignment ( Static/Dynamic) Transport (Static/Dynamic) Ordering (Static/Dynamic) + >> Is instruction assignment to ALUs predetermined? Are operand routes predetermined? Is the execution order of instructions assigned to a node predetermined? % & /

Static Dynamic Static Dynamic Static RawDyn [00] Raw [97] Scale [04] GRID [01] WaveScalar [03] Static Dynamic ILDP[00] OOO- Superscalar Assignment Transport Ordering How Raw relates to other distributed microprocessors using AsTrO taxonomy

Conclusions VLSI Scalable microprocessors are possible. Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct ALU Raw – Now - 1,024 ALU Raw ,768 ALU Raw – If Moore’s Law makes it to 2 nm There is an opportunity to make processors more “versatile” i.e., steal applications from custom chips. Tiled Processor Architectures are a promising approach and merit further research.

* * **

Embedded system: 1020 Element Microphone Array