Baring It All to Software: Raw Machines

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

PSU CS 106 Computing Fundamentals II Introduction HM 1/3/2009.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

Automated Design of Custom Architecture Tulika Mitra

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

DISTRIBUTED COMPUTING

EKT303/4 Superscalar vs Super-pipelined.

Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

COMP 740: Computer Architecture and Implementation

15-740/ Computer Architecture Lecture 3: Performance

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Design-Space Exploration

Application-Specific Customization of Soft Processor Microarchitecture

A Common Machine Language for Communication-Exposed Architectures

CS161 – Design and Architecture of Computer Systems

Architecture & Organization 1

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Chapter 3 Top Level View of Computer Function and Interconnection

Anne Pratoomtong ECE734, Spring2002

Parallel and Multiprocessor Architectures

Pipelining review.

Architecture & Organization 1

Levels of Parallelism within a Single Processor

The Stanford FLASH Multiprocessor

Instruction Level Parallelism and Superscalar Processors

Computer Structure S.Abinash 11/29/ _02.

Pipelining in more detail

ECEG-3202 Computer Architecture and Organization

Multiprocessors - Flynn’s taxonomy (1966)

Adaptive Single-Chip Multiprocessing

Control unit extension for data hazards

Mattan Erez The University of Texas at Austin

General Purpose Processors as Processor Arrays

Introduction to Heterogeneous Parallel Computing

RAW Scott J Weber Diagrams from and summary of:

What is Computer Architecture?

The Vector-Thread Architecture

ECE 352 Digital System Fundamentals

Mattan Erez The University of Texas at Austin

What is Computer Architecture?

Levels of Parallelism within a Single Processor

What is Computer Architecture?

Control unit extension for data hazards

Control unit extension for data hazards

Application-Specific Customization of Soft Processor Microarchitecture

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

CSE378 Introduction to Machine Organization

William Stallings Computer Organization and Architecture

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Baring It All to Software: Raw Machines Waingold, Taylor, et. al. Massachusetts Institute of Technology, Lab. for CS Presented by Garver Moore for ECE 259: Advanced Computer Architecture II Prof. D.J. Sorin Duke University

These three trends . . . Verification Complexity and Constraints - Superscalar verification - Dynamic execution structures  Area, Complexity - Corner cases++; 2) Chip Wire Length Constraints - Pipelined communication b/w resources - Clock net limits - Xmission-line design 3) “Dynamic” Workload Space - “Changing application workloads” - Y2K ISA appropriate for Y2K workloads - E.G. Streaming I/D Apps (MMX / SSE)?

. . . motivate “Raw” Architectures Philosophy: Tile machine (a la 128-CMP) Per tile: - Instruction Stream - Cache (I$ D$ and memories) - Functional units (vis regs, ALU) - Switch (reprogrammable) - (Re)configurable units (More on this later) - Leverage STATIC information - Provide correctness for dynamic events

Proposed Raw tile 3 Distinct Approaches Point 2 Point inter-tile network No instruction traverses more than 1 tile width per cycle Reconfigurable switch memory enables scheduling directives Architecturally visible registers ALU operations Configurable logic -- 3 memory/state distribution models: Raw – memory ports and register file is disributed amoing a switched p2p network between f-units and state -- Superscalar communicates between 1 mem and state port and distributed functional units on a large, often pipelined, global bus -- Traditional Multiprocessors distribute memory,state,and functional units on a switched network though memory. Key diff b/w Raw and M-P is granularity of communications. Raw Superscalar Multiprocessor

Configurable Logic (CL) Do-it-yourself architecture extensions Create customized instructions Example: Game of Life “benchmark” drop 22 cycle software sequence to 1 instruction

Raw vs. Other Architectures I A[b[i]] = A[c[i]]; Systolic Arrays: (Mark II Colossus) - slightly more recently, NuMesh (MIT) - Almost ZERO support for dynamic events, reconfiguration, patterns. FPGAs: - Configurable, application specific VLIW: - large Register namespace - Distributed register file - Massive compiler dependency Systolic Arrays: One of first computers (Mark II Colossus) code breaking – wheel settings for the Lorentz enciphering machine – directional dataflow between functional units – e.g. inputs taken from “NW” and outputs given to “SE.” Obviously does not allow for dynamic dependencies in its simplest form. NuMesh is a packaging and interconnect technology supporting high-bandwidth systolic communications on a 3D nearest-neighbor lattice; our goal is to combine Lego-like modularity with supercomputer performance. To date, the primary focus of the project has been the class of applications whose static communication patterns can be precompiled into independent and carefully choreographed finite state machines running on each node. extensions of the NuMesh to more general communication paradigms have been implemented. FPGAs: configurability obvious, do not support instruction sequencing and “onerous” compilation times. RAW architecture has complex but pre-compiled units (ALUs, et. Cetera). VLIW: “inspiring” RAW – similar dependence on static information, distributed registers, many registers, etc. However RAW allows for multiple I streams – can perform independent but static scheduled computations in different tiles.

Raw vs. Other Architectures II Multiscalar - “Deceptive” similarity - Resources unexposed - E.G. Value forwarding CMP - Simple replication - Message startup / synchronization performance issues IRAM - “on-chip balance.” - still, long bitlines and multibanked memory delays - might suffice “now” (1997) but in future processes will be exposed Multiscalar – hardware Renaming, expose only 32 arch-vis registers – Raw gives compiler more flexibility Raw allows explicit value forwarding to tiles, which allows for a scalable interconnect – multiscalar approach uses a bus for broadcast to tiles.

Results – “RawLogic” FPGA Implementation Does not support “general” instruction processing – converted static control sequences into state machines Less flexible, more compilation time

Questions / Discussion I Small register name-space problem? “Reducing HW support . . . opposes current trends, but [more area] and reduced verification complexity. Taken together, these benefits can make the software synthesis of complex operations competitive with hardware for overall application performance. (Emphasis mine)” Limits of do-it-yourself ISA? Where is the dynamic limit? I/O? Contexts? Along same vein, appropriate performance evaluation? Or too-tailored (i.e. tarantula) Market size?

Questions / Dicussions II How have innovations since 1997 affected this Is there a limit to multiple-granularity reconfigurability’s usefulness? The Prophecy: “In 10 to 15 years, we believe that [giga-xistor chips] faster switch speeds, and growing compiler sophistication will allow a Raw machine’s performance/cost ratio to surpass that of traditional architectures for future, general-purpose workloads” Dynamic event support – too thin? “The Google Test”