Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

Slides:



Advertisements
Similar presentations
Homework Reading Machine Projects Labs
Advertisements

WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s: Computer Hardware 1 WATERLOO ELECTRICAL AND COMPUTER ENGINEERING 20s Computer Hardware Department of.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.
Presenter: Jeremy W. Webb Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Processor Architectures At A Glance: M.I.T.
1 Advanced VLSI Course Class Presentation An Asynchronous Array of Simple Processors for DSP Applications Rasoul Yousefi Electrical and Computer Engineering.
Clock Design Adopted from David Harris of Harvey Mudd College.
Henry Hexmoor1 Chapter 7 Henry Hexmoor Registers and RTL.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
TCSS 372A Computer Architecture. Getting Started Get acquainted (take pictures) Discuss purpose, scope, and expectations of the course Discuss personal.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
68000 Interface Timing Diagrams Outline –68000 Read Cycle –68000 Write Cycle Goal –Understand bus cycles –Learn how to attach memory, peripherals.
High Speed Digital Systems Lab Spring/Winter 2010 Part A final presentation Instructor: Rolf Hilgendorf Students: Elad Mor, Ilya Zavolsky Integration of.
CSS Lecture 2 Chapter 3 – Connecting Computer Components with Buses Bus Structures Synchronous, Asynchronous Typical Bus Signals Two level, Tri-state,
CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.
Computer Organization and Architecture
INPUT-OUTPUT ORGANIZATION
Hierarchical Physical Design Methodology for Multi-Million Gate Chips Session 11 Wei-Jin Dai.
Lecture 12 Today’s topics –CPU basics Registers ALU Control Unit –The bus –Clocks –Input/output subsystem 1.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
CAD for Physical Design of VLSI Circuits
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Top Level View of Computer Function and Interconnection.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
J. Christiansen, CERN - EP/MIC
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
CSS 372 Oct 4th - Lecture 3 Chapter 3 – Connecting Computer Components with Buses Bus Structures Synchronous, Asynchronous Typical Bus Signals Two level,
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
EEE440 Computer Architecture
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
CHAPTER 4 The Central Processing Unit. Chapter Overview Microprocessors Replacing and Upgrading a CPU.
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,
Chapter 4 MARIE: An Introduction to a Simple Computer.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu,
Lecture 11: FPGA-Based System Design October 18, 2004 ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design.
Multi-Split-Row Threshold Decoding Implementations for LDPC Codes
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Baseband Implementation of an OFDM System for 60GHz Radios: From Concept to Silicon Jing Zhang University of Toronto.
An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric.
Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.
Low Power, High-Throughput AD Converters
Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.
A 1.2V 26mW Configurable Multiuser Mobile MIMO-OFDM/-OFDMA Baseband Processor Motivations –Most are single user, SISO, downlink OFDM solutions –Training.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
REGISTER TRANSFER LANGUAGE (RTL) INTRODUCTION TO REGISTER Registers1.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Clockless Chips Under the esteemed guidance of Romy Sinha Lecturer, REC Bhalki Presented by: Lokesh S. Woldoddy 3RB05CS122 Date:11 April 2009.
Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.
ECE Department, University of California, Davis
Electronics for Physicists
An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Overview of Computer Architecture and Organization
HIGH LEVEL SYNTHESIS.
Electronics for Physicists
William Stallings Computer Organization and Architecture
Presentation transcript:

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California, Davis, USA

Outline Introduction Timing issues Scalability issues A design example

Tile-based Chip Multiprocessors and GALS Clocking Style Chip multiprocessors –High performance due to parallel computing –Potential high energy efficiency since high performance may allow reducing clock and voltage Tile-based architecture –Highly scalable Globally Asynchronous Locally Synchronous –Simplified clock tree design –High energy efficiency from adaptive clock/voltage scaling for each module

Tile-based GALS Chip Multiprocessors Globally synchronous vs. GALS –Tile-based GALS chip multiprocessors have nearly perfect scalability

Hierarchical Physical Design Flow Three steps –Oscillator –Single processor –Entire chip Chip array is assembled by a simple tiling of processors

The Challenges Timing issues –Boundaries between clock domains Scalability issues –The most important global signal (clock) is avoided –But, clock might not be the only global signal P1P2 clk1 clk2 P1P2 P3 programming, configuration

Outline Introduction Timing issues Scalability issues A design example

Two Methods to Cross the GALS Clock Domains Single transaction handshake –Each data word is acknowledged before a subsequent transfer Coarse grain flow control –Data words are transmitted without an individual acknowledgement

Overview of Timing Issues Signal categories between processor A and B –A to B clock, synchronizing the source signals –A to B signals, data and other signals such as “valid” –B to A signals, such as “ready” or “hold” signals Each processor contains two clock domains

Two Clock Domains within One Processor Use dual-clock FIFO to handle the unrelated read and write clock within one processor Multiple Flip-flops are inserted at the clock domain boundary as a configurable synchronizer

Inter-processor Timing Issues --- Three Communication Methods (a): sends clock only when there is valid data (b): sends clock one cycle earlier and one cycle later than the valid data (c): always sends clock

Timing Waveform of the Inter-processor Communication Send clk Send data Rec. clk D clk Rec. data D data timing violation DLY Rec. data sample time, no timing violation D DLY

Circuitry for Inter-processor Communication Insert configurable DLY logic at the path of data –Compensate the additional clock tree delay and avoid the timing violation

Inter-chip Communication Inter-chip communication shares similar features with inter-processor communication The path is longer and the timing is more complex –Output processor might need low speed clock –Destination processor can operate at full speed

Outline Introduction Timing issues Scalability issues A design example

Special Signals Besides Clock Avoiding designing a global clock enhances scalability, but there are some other signals that must be addressed to maintain high scalability –Various global signals such as configuration and test signals –Power distribution –Processor IO pins Key idea: avoid or isolate all global signals if possible, so multiple processors can be directly tiled without further changes

Clocking and Buffering of Global Signals There are unavoidable global signals such as configure and test Three options: –Pipeline these signals –Asynchronous interconnect –Use a low speed clock, and buffer signals in each processor

Complete Power Distribution for Each Processor

Position of IO Pins Position of IO pins is important since they must connect to other processors Align IO pins with each other so that connecting wires are very short

Outline Introduction Timing issues Scalability issues A design example

An Asynchronous Array of simple Processors (AsAP) Single-chip tile-based 6 x 6 GALS multiprocessor –Simple architecture & small mems. for each processor –Nearest neighbor interconnect between processors Targets computationally intensive DSP apps FIFO 0 CPU OSC Output FIFO 1

Back-end Design Flow Standard cell based design flow was used –RTL coding –Synthesis –Placement & Routing Intensive verification were used throughout the design process –Gate level analysis –Circuit level simulation –DRC/LVS –Formal verification

Chip Micrograph Technology: TSMC 0.18 µm Max speed: V Area: 1 Proc 0.66 mm² Chip 32.1 mm² Power (1 1.8V, 475 MHz): Typical application 32 mW Typical 100% active 84 mW Power(1 0.9V, 116 MHz): Typical application 2.4 mW single processor

Summary GALS tile-based chip multiprocessor is an attractive architecture –High performance, energy efficient, and highly scalable Timing issues –Multiple clock domains in one single processor –Inter-processor communication –Inter-chip communication Scalability issues –“Global” signals –Power distribution –IO pins

Acknowledgments Funding –Intel Corporation –UC Micro –NSF Grant No –UCD Faculty Research Grant Special Thanks –E. Work, D. Truong, W. Cheng, T. Jacobson, T. Mohsenin, R. Krishinamurthy, M. Anders, and S. Mathew –MOSIS –Artisan