1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

Self-Timed Logic Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical and.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Self-Timed Systems Timing complexity growing in digital design -Wiring delays can dominate timing analysis (increasing interdependence between logical.
1 Clockless Logic  Recap: Lookahead Pipelines  High-Capacity Pipelines.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Jordi Cortadella, Universitat Politecnica de Catalunya, Barcelona Mike Kishinevsky, Intel Corp., Strategic CAD Labs, Hillsboro.
Digital Integrated Circuits© Prentice Hall 1995 Timing ISSUES IN TIMING.
Advances in Clockless and Mixed-Timing Digital Systems Prof. Steven M. Nowick Department of Computer Science Columbia University.
1 Clockless Logic Montek Singh Thu, Jan 13, 2004.
1 Clockless Logic Montek Singh Tue, Mar 23, 2004.
Advances in Designing Clockless Digital Systems Prof. Steven M. Nowick Department of Computer Science Columbia University New York,
1 Clockless Logic Montek Singh Tue, Mar 16, 2004.
ELEC 6200, Fall 07, Oct 24 Jiang: Async. Processor 1 Asynchronous Processor Design for ELEC 6200 by Wei Jiang.
Low Power Design for Wireless Sensor Networks Aki Happonen.
COMP Clockless Logic and Silicon Compilers Lecture 3
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Jordi Cortadella, Universitat Politècnica de Catalunya, Spain
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
1 Exact Two-Level Minimization of Hazard-Free Logic with Multiple Input Changes Montek Singh Tue, Oct 16, 2007.
1 Clockless Logic Montek Singh Tue, Mar 21, 2006.
A. A. Jerraya Mark B. Josephs South Bank University, London System Timing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA
1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Thu, Jan 8, 2004.
1 Clockless Computing Montek Singh Thu, Sep 13, 2007.
Fall 2009 / Winter 2010 Ran Ginosar (
Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.
1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
1  1801, Joseph Marie Jacquard Jacquard Loom and punch cards to program it. (George H. Williams, photos from Wikipedia) George H. WilliamsGeorge H. Williams.
Clockless Chips Date: October 26, Presented by:
1 Seminar on High-Speed Asynchronous Pipelines Montek Singh Thursdays 10-11, SN325.
Low-Power Wireless Sensor Networks
Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
ECE 456 Computer Architecture
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.
Copyright © Silistix, all rights reserved Glitch Sensitivity and Defense of QDI NoC Links Sean Salisbury 18 May 2009.
1 COMP541 Combinational Logic - 4 Montek Singh Jan 30, 2012.
ELEC692/04 course_des 1 ELEC 692 Special Topic VLSI Signal Processing Architecture Fall 2004 Chi-ying Tsui Department of Electrical and Electronic Engineering.
1 Clockless Computing Montek Singh Thu, Sep 6, 2007  Review: Logic Gate Families  A classic asynchronous pipeline by Williams.
1 COMP Clockless Logic and Silicon Compilers or How do I take “hard” out of hardware design? Montek Singh Thu, Jan 12, 2006.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Reader: Pushpinder Kaur Chouhan
Reading Assignment: Rabaey: Chapter 9
Lecture 11: FPGA-Based System Design October 18, 2004 ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design.
1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.
1 Bridging the gap between asynchronous design and designers Peter A. BeerelFulcrum Microsystems, Calabasas Hills, CA, USA Jordi CortadellaUniversitat.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.
Clockless Chips Under the esteemed guidance of Romy Sinha Lecturer, REC Bhalki Presented by: Lokesh S. Woldoddy 3RB05CS122 Date:11 April 2009.
Submitted by Abi Mathew Roll No:1
1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,
Welcome To Seminar Presentation Seminar Report On Clockless Chips
Asynchronous Interface Specification, Analysis and Synthesis
Roadmap History Synchronized vs. Asynchronous overview How it works
Stateless Combinational Logic and State Circuits
Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?
Architecture & Organization 1
Architecture & Organization 1
Overview of Computer Architecture and Organization
Overview of Computer Architecture and Organization
Emerging Technologies of Computation
Clockless Logic: Asynchronous Pipelines
Clockless Computing Lecture 3
William Stallings Computer Organization and Architecture
Presentation transcript:

1 Clockless Logic or How do I make hardware fast, power- efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003

2 Course Information (1) Course Number: COMP Time and Place Tue/Thu 3:30-4:45pm, Sitterson Hall 325 Tue/Thu 3:30-4:45pm, Sitterson Hall 325Instructor Montek Singh Montek Singh (not (not SN 245, SN 245, Office hours: most afternoons/by appointment Office hours: most afternoons/by appointment Teaching Assistant None None Course Web Page

3 Course Information (2) Prerequisites: undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) no knowledge of advanced circuit design or of VLSI is assumed no knowledge of advanced circuit design or of VLSI is assumed  relevant topics will be covered in class as needed you are assumed to know the following topics: you are assumed to know the following topics:  digital logic: Boolean algebra, logic gates, and latches and registers  algorithms: search techniques, enumeration, divide and conquer, and time complexity  discrete math: elementary set theory and graph theory

4 Course Information (3) Reading Material: Papers and technical reports supplied by instructor Papers and technical reports supplied by instructor Course Content: The following topics will be covered: The following topics will be covered:  Introduction to clockless logic  Graphical representation of asynchronous systems  Algorithms for logic synthesis –Combinational –Sequential  Design techniques –High-performance –Low-power  Formal methods (performance analysis and verification)  Case studies of real-world asynchronous processors

5 Course Information (4) Grading 30% homework assignments 30% homework assignments 35% class project 35% class project  your choice of topic: from pure algorithms to VLSI design 30% exams 30% exams 5% class participation 5% class participation Honor Code is in effect encouraged to discuss ideas/concepts encouraged to discuss ideas/concepts work handed in must be your own work handed in must be your own

6 Lecture 1: Introduction  What is asynchronous design?  Why do we want to study it?  How is data represented in an asynchronous system?  How is information exchanged?

7 Introduction: Clocked Digital Design Most current digital systems are synchronous: Clock: a global signal that paces operation of all components Clock: a global signal that paces operation of all components clock Benefit of clocking: enables discrete-time representation all components operate exactly once per clock tick all components operate exactly once per clock tick component outputs need to be ready by next clock tick component outputs need to be ready by next clock tick  allows “glitchy” or incorrect outputs between clock ticks

8 Microelectronics Trends Current and Future Trends: Significant Challenges Large-Scale “Systems-on-a-Chip” (SoC) Large-Scale “Systems-on-a-Chip” (SoC)  100 Million ~ 1 Billion transistors/chip Very High Speeds Very High Speeds  multiple GigaHertz clock rates Explosive Growth in Consumer Electronics Explosive Growth in Consumer Electronics  demand for ever-increasing functionality …  … with very low power consumption (limited battery life) Higher Portability/Modularity/Reusability Higher Portability/Modularity/Reusability  “plug ’n play” components, robust interfaces

9 Challenges to Clocked Design Breakdown of Single-Clock Paradigm: Chip will be partitioned into multiple timing domains Chip will be partitioned into multiple timing domains  challenge: gluing together multiple timing domains –glue logic is susceptible to “metastability” (=incorrect values transferred) and latency overheads Increasing Difficulties with Clocked Design: Clock distribution: requires significant designer effort Clock distribution: requires significant designer effort Performance bottleneck: a single slow component Performance bottleneck: a single slow component Clock burns large fraction of chip power (~40-70%) Clock burns large fraction of chip power (~40-70%) Fixed clock rate: poor match for Fixed clock rate: poor match for  designing reusable components  interfacing with mixed-timing environments

10 What is Asynchronous Design?  Digital design with no centralized clock  Synchronization using local “handshaking” Asynchronous System (Distributed Control) handshakinginterface Synchronous System (Centralized Control) clock

11 Why Asynchronous Design? (1)  Higher Performance May obtain “average-case” operation (not “worst-case”) May obtain “average-case” operation (not “worst-case”)  not limited by slowest component Avoids overheads of multi-GHz clock distribution Avoids overheads of multi-GHz clock distribution  Lower Power No clock power expended No clock power expended Inactive components consume negligible power Inactive components consume negligible power  Better Electromagnetic Compatibility Smooth radiation spectra: no clock spikes Smooth radiation spectra: no clock spikes Much less interference with sensitive receivers [e.g., Philips pagers, smartcards] Much less interference with sensitive receivers [e.g., Philips pagers, smartcards]  Greater Flexibility/Modularity Naturally adapt to variable-speed environments Naturally adapt to variable-speed environments Supports reusable components Supports reusable components

12 Why Asynchronous Design? (2)  The world already is mostly asynchronous! Events at the level of (or in between) large-scale systems are asynchronous Events at the level of (or in between) large-scale systems are asynchronous  several seconds to several milliseconds  e.g., PC-printer communication, keyboard inputs, network comm. Events at the board level (or between chips) are often asynchronous Events at the board level (or between chips) are often asynchronous  milliseconds to 100 nanoseconds  e.g., CPU-memory interface, interface with I/O subsystem (interrupts) Events within a chip, at the level of functional units (e.g., adders, control logic) are currently synchronous Events within a chip, at the level of functional units (e.g., adders, control logic) are currently synchronous  several nanoseconds to 100 picoseconds Events at the level of a single logic gate are asynchronous Events at the level of a single logic gate are asynchronous  10 picoseconds Events at the quantum level are asynchronous Events at the quantum level are asynchronous  picoseconds to femtoseconds  So, why bother with clocks at all?! make everything asynchronous  greater elegance and robustness make everything asynchronous  greater elegance and robustness

13 Challenges of Asynchronous Design communication must be hazard-free! communication must be hazard-free! special design challenge = “hazard-free synthesis” special design challenge = “hazard-free synthesis”  Testability Issues: absence of clock means no “single-stepping” absence of clock means no “single-stepping”  Lack of Commercial CAD Tools: chicken-and-egg problem chicken-and-egg problem  Hazards: potential “glitches” on wire clean signals hazardous signals clock tick no problem for clocked systems

14 Asynchronous Design: Past & Present Async Design: In existence for 50 years, but … … many recent technical advances: Hazard-Free Circuit Design: Hazard-Free Circuit Design:  several practical techniques for controllers [Stanford/Columbia] Design for Testability: Design for Testability:  several test solutions, e.g. Philips Research Maturing Computer-Aided-Design (“CAD”) Tools: Maturing Computer-Aided-Design (“CAD”) Tools:  software tools for automated design [Philips,Columbia,Manchester] Successful Fabricated Chips: Successful Fabricated Chips:  embedded processors, high-speed pipelines, consumer electronics…

15 Recent Commercial Interest Several commercial asynchronous chips: Philips: asynchronous 80c51 microcontrollers Philips: asynchronous 80c51 microcontrollers  used in commercial pagers [1998] and smartcards [2001] Univ. of Manchester: async ARM processor [2000] Univ. of Manchester: async ARM processor [2000] Motorola: async divider in PowerPC chip [2000] Motorola: async divider in PowerPC chip [2000] HAL: async floating-point divider HAL: async floating-point divider  in HAL-I and II processors [early 1990’s] Recent experimental chips: IBM, Sun and Intel: IBM, Sun and Intel:  fast pipelines, arbiters, instruction-length decoder… IBM/Columbia/UNC: asynchronous digital FIR filter IBM/Columbia/UNC: asynchronous digital FIR filter Several recent startups: Theseus Logic, Fulcrum, Self-Timed Solutions… Theseus Logic, Fulcrum, Self-Timed Solutions…

16 A 5-minute Homework Problem Alice and Bob live on opposite sides of a wide river: Alice is supposed to send a message (say, a “Yes”/”No”) across to Bob around midnight. Both have flashlights, but neither owns a watch. What should they do? Suggest several strategies, and discuss pros and cons of each. AliceBob

17 got it Solution 1 Alice uses 2 lamps: 1 to indicate that she is ready with the message, and 1 to indicate that she is ready with the message, and 1 for the message itself 1 for the message itself Bob uses 1 lamp: to indicate that he has received the message to indicate that he has received the message Alice Bob ready yes/no

18 Solution 2 Alice uses 2 lamps: Green lamp to indicate “yes” Green lamp to indicate “yes” Red lamp to indicate “no” Red lamp to indicate “no” Bob uses 1 lamp: to indicate that he has received the message to indicate that he has received the message got it Alice Bob no yes

19 Solution 3 What if Alice and Bob could keep time? Alice uses 1 lamp for the message: At 12 midnight: turns on lamp if message = “yes” At 12 midnight: turns on lamp if message = “yes” At 12:01: turns lamp off At 12:01: turns lamp off Bob needs no lamps! Takes down the message between 12 and 12:01 Takes down the message between 12 and 12:01 Pros: Fewer signals, lesser processing needed Cons: Alice and Bob must keep their clocks closely synchronized If Bob’s watch is off by a minute, incorrect communication possible If Bob’s watch is off by a minute, incorrect communication possible

20 Data Representation Styles: “Bundled Data” Single-rail “Bundled Datapath”: simplest approach widely used widely usedFeatures: datapath: 1 wire per bit (e.g. standard sync blocks) datapath: 1 wire per bit (e.g. standard sync blocks) matched delay: produces delayed “done” signal matched delay: produces delayed “done” signal  worst-case delay: longer than slowest path +Practical style: can reuse sync components ; small area –Fixed (worst-case) completion time done indicates valid data valid data bit 1 request bit n bit 1 bit m done matcheddelay function block

21 +provides robust data-dependent completion –needs completion detectors Data Representation Styles: Dual-Rail Dual-rail: uses 2 wires per data bit bit n bit 1 bit m bit 1 Each Dual-Rail Pair: provides both data value and validity

22 Dual-Rail (contd.) Dual-Rail Completion Detector: combines dual-rail signals combines dual-rail signals indicates when all bits are valid (or reset) indicates when all bits are valid (or reset) C Done OR bit 0 OR bit 1 OR bit n  OR together 2 rails per bit  Merge results using a Müller “C-element” C-element: if all inputs=1, output  1 if all inputs=1, output  1 if all inputs=0, output  0 if all inputs=0, output  0 else, maintain output value else, maintain output valueC-element: if all inputs=1, output  1 if all inputs=1, output  1 if all inputs=0, output  0 if all inputs=0, output  0 else, maintain output value else, maintain output value

23 4-Phase: requires 4 events per handshake Handshaking Styles: 4-phase Request Acknowledge start event done get ready for next event ready for next event +“Level-sensitive”  simpler logic implementation –Overhead of “return-to-zero” (RTZ or resetting) extra events which do no useful computation extra events which do no useful computation

24 +Elegant: no return-to-zero –Slower logic implementation: logic primitives are inherently level-sensitive, not event-based (at least in CMOS) logic primitives are inherently level-sensitive, not event-based (at least in CMOS) Handshaking Styles: 2-phase 2-Phase: requires 2 events per handshake Request Acknowledge start event done start next event next event done

25 Handshaking + Data Representation Several combinations possible: dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single- rail 2-phase dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single- rail 2-phase Example: dual-rail 4-phase dual-rail data: functions as an implicit “request” dual-rail data: functions as an implicit “request” 4-phase cycle: between acknowledge and implicit request 4-phase cycle: between acknowledge and implicit request bit m bit 1 ack A B

26 Other Data Representation Styles  Level-Encoded Dual-Rail (LEDR) 2 wires per bit: “data” and “phase” 2 wires per bit: “data” and “phase” exactly one wire per bit changes value exactly one wire per bit changes value  if new value is different, “data” wire changes value  else “phase” wire change value  M-of-N Codes N wires used for a data word N wires used for a data word M wires (M <= N) change value M wires (M <= N) change value Values of N and M: have impact on… Values of N and M: have impact on…  information transmitted, power consumed and logic complexity  Knuth codes, Huffman codes, … data phase

27 Which to use? Depends on several performance parameters: speed speed  single-rail vs. dual-rail –single-rail may be faster (if designed aggressively) –dual-rail may be faster (if completion times vary widely)  2-phase vs. 4-phase –2-phase may be faster (if logic overhead is small) –4-phase may be faster (if overhead of return-to-zero is small) power consumption power consumption  2-phase typically has fewer gate transitions (  lower power) amount of logic used (#gates/wires/pins  chip area) amount of logic used (#gates/wires/pins  chip area)  single-rail needs fewer gates/wires/pins design and verification effort design and verification effort  dual-rail, 1-of-N, M-of-N, Knuth codes…: –delay-insensitive: robust in the presence of arbitrary delays  single-rail: requires greater timing verification effort

28 Sutherland’s Micropipelines Seminal Paper

29 Focus of Sutherland’s Turing Award Lecture: Pipelining Motivation: Pipelining is at the heart of nearly all high-performance digital systems high-performance digital systems Additional Benefits: Low power Low power Interfacing with mixed systems Interfacing with mixed systems Modular and scalable design Modular and scalable design

30 A “coarse-grain” pipeline (e.g. simple processor) A “fine-grain” pipeline (e.g. pipelined adder) fetchdecodeexecute Background: Pipelining What is Pipelining?: Breaking up a complex operation on a stream of data into simpler sequential operations + Throughput: significantly increased – Latency: somewhat degraded Storage elements (latches/registers) Throughput = #data items processed/second

31 Focus of Async Community Our Focus: Extremely fine-grain pipelines “gate-level” pipelining = use narrowest possible stages “gate-level” pipelining = use narrowest possible stages each stage consists of only a single level of logic gates each stage consists of only a single level of logic gates  some of the fastest existing digital pipelines to date Application areas: multimedia hardware (graphics accelerators, video DSP’s, …) multimedia hardware (graphics accelerators, video DSP’s, …)  naturally pipelined systems, throughput is critical  input is often “bursty” optical networking optical networking  serializing/deserializing FIFO’s genomic string matching? genomic string matching?  KMP style string matching: variable skip lengths