Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge ASYNC 2007,

Slides:



Advertisements
Similar presentations
Registers Computer Organization I 1 September 2009 © McQuain, Feng & Ribbens A clock is a free-running signal with a cycle time. A clock may.
Advertisements

CS370 – Spring 2003 Hazards/Glitches. Time Response in Combinational Networks Gate Delays and Timing Waveforms Hazards/Glitches and How To Avoid Them.
CSCI 4717/5717 Computer Architecture
Data Synchronization Issues in GALS SoCs Rostislav (Reuven) Dobkin and Ran Ginosar Technion Christos P. Sotiriou FORTH ICS- FORTH.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
6.852: Distributed Algorithms Spring, 2008 Class 7.
ELEC 256 / Saif Zahir UBC / 2000 Timing Methodology Overview Set of rules for interconnecting components and clocks When followed, guarantee proper operation.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 24: November 4, 2011 Synchronous Circuits.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /23/2013 Lecture 7: Computer Clock & Memory Elements Instructor: Ashraf Yaseen DEPARTMENT OF MATH &
Avshalom Elyada, Ran GinosarPipeline Synchronization 1 A Unique and Successfully Implemented Approach to the Synchronization Problem Based on the article.
Presenter : Ching-Hua Huang 2012/4/16 A Low-latency GALS Interface Implementation Yuan-Teng Chang; Wei-Che Chen; Hung-Yue Tsai; Wei-Min Cheng; Chang-Jiu.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Lecture #34 Page 1 ECE 4110–5110 Digital System Design Lecture #34 Agenda 1.Timing 2.Clocking Techniques Announcements 1.n/a.
EE141 © Digital Integrated Circuits 2nd Timing Issues 1 Digital Integrated Circuits A Design Perspective Timing Issues Jan M. Rabaey Anantha Chandrakasan.
Synchronous Digital Design Methodology and Guidelines
1 Digital Design: State Machines Timing Behavior Credits : Slides adapted from: J.F. Wakerly, Digital Design, 4/e, Prentice Hall, 2006 C.H. Roth, Fundamentals.
RTL Hardware Design by P. Chu Chapter 161 Clock and Synchronization.
Assume array size is 256 (mult: 4ns, add: 2ns)
The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan.
Charles Kime & Thomas Kaminski © 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Chapter 6 –Selected Design Topics Part 3 – Asynchronous.
ELEC 6200, Fall 07, Oct 24 Jiang: Async. Processor 1 Asynchronous Processor Design for ELEC 6200 by Wei Jiang.
Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Tutorial presented at 18 th UK Asynchronous Forum Newcastle, September 2006.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
ASYNC 2000 Eilat April Priority Arbiters Alex Bystrov David Kinniment Alex Yakovlev University of Newcastle upon Tyne, UK.
A. A. Jerraya Mark B. Josephs South Bank University, London System Timing.
Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11.
Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge Workshop on On- and Off-Chip Interconnection.
Network-on-Chip Links and Implementation Issues System-on-Chip Group, CSE-IMM, DTU.
Mahapatra-A&M-Sprong'021 Co-design Finite State Machines Many slides of this lecture are borrowed from Margarida Jacome.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Chapter #6: Sequential Logic Design 6.2 Timing Methodologies
Digital Integrated Circuits for Communication
ENGSCI 232 Computer Systems Lecture 5: Synchronous Circuits.
Digital System Bus A bus in a digital system is a collection of (usually unbroken) signal lines that carry module-to-module communications. The signals.
Finite State Machines. Binary encoded state machines –The number of flip-flops is the smallest number m such that 2 m  n, where n is the number of states.
Elastic-Buffer Flow-Control for On-Chip Networks
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Robust Low Power VLSI ECE 7502 S2015 Analog and Mixed Signal Test ECE 7502 Class Discussion Christopher Lukas 5 th March 2015.
(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:
1 H ardware D escription L anguages Modeling Digital Systems.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
Mahapatra-A&M-Fall'001 Co-design Finite State Machines Many slides of this lecture are borrowed from Margarida Jacome.
Topic: Sequential Circuit Course: Logic Design Slide no. 1 Chapter #6: Sequential Logic Design.
EEE440 Computer Architecture
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
Reading Assignment: Rabaey: Chapter 9
Sept. 2005EE37E Adv. Digital Electronics Lesson 2 Advanced Aspects of Digital Logic.
CERN, 18 december 2003Coincidence Matrix ASIC PRR Coincidence ASIC modifications E.Petrolo, R.Vari, S.Veneziano INFN-Rome.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Sequential Logic Computer Organization II 1 © McQuain A clock is a free-running signal with a cycle time. A clock may be either high or.
SoC Clock Synchronizers Project Elihai Maicas Harel Mechlovitz Characterization Presentation.
Virtual-Channel Flow Control William J. Dally
LECTURE 4 Logic Design. LOGIC DESIGN We already know that the language of the machine is binary – that is, sequences of 1’s and 0’s. But why is this?
EE 466/586 VLSI Design Partha Pande School of EECS Washington State University
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 20: October 25, 2010 Pass Transistors.
May 2006Andreas Steininger1 D istributed A lgorithms for R obust T ick S ynchronization.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Overview Part 1 – The Design Space
Chapter #6: Sequential Logic Design
Clocks A clock is a free-running signal with a cycle time.
Approximating the Buffer Allocation Problem Using Epochs
ECE 352 Digital System Fundamentals
Presentation transcript:

Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge ASYNC 2007, 13 th IEEE International Symposium on Asynchronous Circuits and Systems

2 System-Timing: Emerging Challenges Current shift is from complex monolithic designs to networks of energy efficient cores Distinct block and system- level timing challenges Network-level timing –Physically distributed –Activity may be sparse –Interconnect delay and power are significant –Significant variations in temperature, supply voltage and process parameters Higher-level control, timing and scheduling is naturally event-driven

3 Combining Local and Global Approaches to Timing Synchronization free approaches Coping with metastability –Timing-Safe Allocate a fixed period of time for metastability to resolve, e.g. two flip-flop synchronizer –Value-Safe Wait for metastability to resolve, e.g. clock stretching or pausing techniques Clock is generated locally Value-safe ideas are less well understood, avoided by industry

4 Advantages of a value-safe approach Efficiency –Synchronization delay is minimized –Opportunities for optimization Robustness –Inherently robust, no trade-off against performance. –Only way to guarantee data is never lost, no MTBF. Could still have functional failures if we are delayed too long – don’t hit performance requirements Transparency –Synchronous block is unaffected by clocking wrapper. –Less true for traditional synchronization and clock- gating approaches. Simplicity and modularity –I aim to illustrate how simple these schemes are

5 Adding an asynchronous interface to a clock generator

6

7

8

9 Input register driven by a pausible clock

10 Data-Driven ClockPausible Clock - May need to add a mechanism to ensure block receives enough clock edges, e.g. to flush pipeline - Need to add an explicit sleep mechanism if we want to halt clock generator during periods of inactivity Helps classify and understand existing techniques. In reality, the design space is a continuum

11 Stretchable Clocks A type of data-driven clock 1.Rising clock edge is generated 2.Stretch signal may be asserted (synchronously) in response to clk+ 3.Low-phase of clock is stretched until some operation has completed and stretch signal is removed

12 Stretchable Clocks

13 Stretchable Clocks

14 Stretchable Clocks

15 Stretchable Clocks

16 Stretchable Clocks

17 Input Ports Arbitrated Inputs –At most one input can be served per cycle Synchronised Inputs –Cannot proceed until multiple inputs are ready Sampled Inputs –Can progress with a variable number of data inputs (or none) Need to also choose event to trigger sampling of inputs Paper provides implementation details for each input port type for pausible and data-driven clock generators

18 Output Ports Scheduled –Ensure data is output on a particular clock cycle, stall until data is consumed Registered –Addition of an output register allows next computation to proceed while data is consumed Polled –Sample output port ready signal and take appropriate action. Clock period is only ever extended to allow metastability to resolve, not because output is blocked.

19 A GALS Wrapper Example Free running clock Asynchronous input –we know nothing about when data will arrive –For simplicity, lets assume we can always accept new data Registered output feeding asynchronous FIFO Simple to combine clock generator, input and output ports

20 A GALS Wrapper Example: Step 1. Local clock generator with H/S interface

21 A GALS Wrapper Example: Step 2. Pausible Clock Template

22 A GALS Wrapper Example: Step 3. Provide registered output port support (stretchable clock template)

23 A GALS Wrapper Example: Step 4.

24 Data-Driven Clocking for On-Chip Networks Why is global synchrony limiting for on-chip networks? –Reconfigurable networks, adaptive low-voltage interconnect drivers, irregular topologies, …. Problem with traditional synchronization techniques –Latency (could easily double best-case latency, our routers are single-cycle – support VCs < 30FO4) Problems with fully-asynchronous implementations –Latency (for the router designs we have examined) –More difficult to speculate? Scheduling is expensive?

25 Data-Driven Clocking for On-Chip Routers Router should be clocked when one or more inputs are valid (or flits are buffered) Elevator analogy… –Free running (paternoster) elevator Chain of open compartments Must synchronise before you jump on! –Traditional elevator (data-driven clock) Wait for someone to arrive Close doors, decide who is in and who is out Metastability issue again (potentially painful!)

26 Data-Driven Clock with Sampled Inputs Local Clock Generator Template Sample inputs when at least one input is ready (and clock is low) Assert Lock Either admitted or locked out (Close Lift Doors) Incoming data

27 Clock Tree Insertion Delays Delay from root to leaf of clock tree can be considerable (certainly non-zero!) If every clock cycle is the same, this clock insertion delay is not normally an issue If we stretch the clock the insertion delay must be considered in our timing analysis (also true for clock gating in synchronous world) Not difficult to handle, but can increase time required to admit new data

28 Clock Tree Insertion Delays Can place logic here

29 Clock Tree Insertion Delays How do we handle multi-cycle insertion delays? In practice, we would want to avoid very large synchronous blocks Need to ensure we admit data on the correct clock cycle Cannot cheat and promote data! We simply remember on which clock cycle data has been scheduled to be admitted

30 Summary Value-safe techniques are simple and robust –Powerful framework for composing synchronous sub- systems –Build efficient event-driven global communication and scheduling infrastructure? –Scope for supporting low-power techniques? (self- timed power-gating, DVFS support, timing- speculation…) Scope for exploiting event-driven scheduling and clocking at system-level. Synchronization costs are low enough to prompt use in on-chip network applications More in the paper, aims to be a useful survey and hopefully fills some gaps too.

31 Thank You!