February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

Slides:



Advertisements
Similar presentations
Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.
Advertisements

FPGA (Field Programmable Gate Array)
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Application-Specific Hardware Computing Without Processors Mihai Budiu October 6, 2001 SOCS-4.
COMP25212 SYSTEM ARCHITECTURE Antoniu Pop Jan/Feb 2015COMP25212 Lecture 1.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
A Detailed Discussion of SRAM Niels Asmussen Maggie Hamill William Hunt.
COMP25212: System Architecture Lecturers Alasdair Rawsthorne Daniel Goodman
CSE 661 PAPER PRESENTATION
Introduction to Computer Architecture. What is binary? We use the decimal (base 10) number system Binary is the base 2 number system Ten different numbers.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Computer Memory Storage Decoding Addressing 1. Memories We've Seen SIMM = Single Inline Memory Module DIMM = Dual IMM SODIMM = Small Outline DIMM RAM.
WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.
SARC Proprietary and Confidential Processor-to-Memory-Blocks NoC with Pre-Configured (but run-time reconfigurable) Low-Latency Routes G. Mihelogiannakis,
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.
Lecture 3: Computer Architectures
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
CS35101 Computer Architecture Spring 2006 Lecture 18: Memory Hierarchy Paul Durand ( ) [Adapted from M Irwin (
1  2004 Morgan Kaufmann Publishers Page Tables. 2  2004 Morgan Kaufmann Publishers Page Tables.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Lynn Choi School of Electrical Engineering
ESE532: System-on-a-Chip Architecture
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Hot Chips, Slow Wires, Leaky Transistors
A Common Machine Language for Communication-Exposed Architectures
Assembly Language for Intel-Based Computers, 5th Edition
Architecture & Organization 1
Basic Computer Organization
Computer Architecture
Stream Architecture: Rethinking Media Processor Design
Architecture & Organization 1
Introduction to Computing
William J. Dally Computer Systems Laboratory Stanford University
Mattan Erez The University of Texas at Austin
William J. Dally Computer Systems Laboratory Stanford University
Mattan Erez The University of Texas at Austin
Chapter 4 Multiprocessors
Presentation transcript:

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University February 12, 1998

February 12, 1999 Architecture and Circuits: 2 On-chip wires 0.0mm 2.5mm 5.0mm 7.5mm 10.0mm Minimum width wire in an 0.35  m process

February 12, 1999 Architecture and Circuits: 3 On-chip wires are getting slower x1x1 x2x2 y y x 2 = s x 1 0.5x R 2 = R 1 /s 2 4x C 2 = C 1 1x t w2 = R 2 C 2 y 2 = t w1 /s 2 4x t w2 /t g2 = t w1 /(t g1 s 3 )8x v = 0.5(t g RC) -1/2 (m/s) v 2 = v 1 s 1/2 0.7x vt g = 0.5(t g /RC) 1/2 (m/gate) v 2 t g2 = v 1 t g1 s 3/2 0.35x t w = RCy 2 RCy 2 tgtg tgtg tgtg

February 12, 1999 Architecture and Circuits: 4 Technology scaling makes communication the scarce resource 0.35  m 64Mb DRAM 16 64b FP Proc 400MHz 0.10  m 4Gb DRAM 1K 64b FP Proc 2.5GHz mm 12,000 tracks 1 clock repeaters every 3mm 32mm 90,000 tracks 20 clocks repeaters every 0.4mm P

February 12, 1999 Architecture and Circuits: 5 Architecture Must Evolve to Fit the Landscape 20 Clocks 90,000 tracks Local, parallel operations High bandwidth Low latency & Low power Global operations Low bandwidth High latency & High power

February 12, 1999 Architecture and Circuits: 6 Architecture Today Depends on Fast Global Communication Regs I-Unit All instructions issued from single global instruction unit All data passes through global register file This won’t work when global accesses cost 20 clocks of latency

February 12, 1999 Architecture and Circuits: 7 Tomorrow’s Architectures must Exploit Locality and Expose Communication Multiple elements (clusters) with –local instruction dispatch –local register files –co-located with arithmetic elements Explicit communication between elements through a switch or network Fast synchronization between instruction units RegsIURegsIURegsIURegsIU Switch

February 12, 1999 Architecture and Circuits: 8 Multi-ALU Processor Chip

February 12, 1999 Architecture and Circuits: 9 1x1.64x5.25x Standard-Cell Full-Custom Crafted-Cell 80 Different Cells7 Different Cells17 Different Cells Design IRRDP ADDSUB Full- Custom Crafted- Cell Standard Cell 2.23x 2.7x 1.11x 1.17x 1.0x Performance Area -Results courtesy of Andrew Chang Crafted-Cell Design

February 12, 1999 Architecture and Circuits: 10 Interconnect: repeaters with switching Need repeaters every 1mm or less Easy to insert switching –zero-cost reconfiguration Can’t afford decision time –static routing fixed or regular pattern –source routing on-demand requires arbitration and fanout Queuing and flow-control Pipelining control 1mm ArbLUT

February 12, 1999 Architecture and Circuits: 12 Bandwidth Hierarchy Provide lots of bandwidth where its inexpensive –short wires between ALUs Moderate bandwidth with intermediate cost –local RAM associated with each ALU cluster Low bandwidth where its expensive –Global RAM with long wires Very low bandwidth off chip Global on-chip RAM Local RAM ALU Cluster ALU Cluster ALU Cluster ALU Cluster off chip global 30mm medium 4mm local 1mm

February 12, 1999 Architecture and Circuits: 13 Bandwidth Hierarchy A key problem is to match the demands of an application to the bandwidth available at each level of the hierarchy Casting applications in a streaming model exposes much of the locality necessary to exploit the hierarchy Global on-chip RAM Local RAM ALU Cluster ALU Cluster ALU Cluster ALU Cluster

February 12, 1999 Architecture and Circuits: 14 Architecture Research Issues Processor architecture –configuration of ALUs clustered vs distributed –method for controlling ALUs distributed control, VLIW, SIMD –communication aware instruction sets how to hide details while exposing communication Memory architecture –methods for exploiting 2D spatial locality –communication aware cache organizations Communication Architecture –on-chip interconnection networks –the use of repeaters with switching –the use of hierarchy and selective ‘fat’ wires

February 12, 1999 Architecture and Circuits: 15 Circuit Challenges of Slow Interconnect The clock cycle is dominated by wire delay –novel circuits to improve effective signal velocity Power is largely used to drive wires –low-swing on-chip signaling methods –reject rather than overpower noise Its difficult to distribute a global clock –locally synchronous design methods –fast synchronizers no wait for metastable decay

February 12, 1999 Architecture and Circuits: 16 Overdrive gives 3x improvement in RC wire latency

February 12, 1999 Architecture and Circuits: 17 Low-Swing Overdrive Signaling 1V Swing at Source 300mV Swing at Receiver Recovered Signal

February 12, 1999 Architecture and Circuits: 18 Conclusion Exploit, Don’t Fight, The Technology Interconnect is rapidly dominating the delay, power, and area of ICs Traditional architectures rely on global communication –they are ill-suited for an interconnect-dominated technology Emerging architectures expose communication and exploit locality –distributed register files and instruction dispatch –bandwidth hierarchy Novel circuits can mitigate effects of slow wires –overdrive, low-swing signaling, locally synchronous design