1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

Slides:



Advertisements
Similar presentations
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
Advertisements

Introduction to CMOS VLSI Design Sequential Circuits.
Introduction to CMOS VLSI Design Sequential Circuits
MICROELETTRONICA Sequential circuits Lection 7.
ELEC 256 / Saif Zahir UBC / 2000 Timing Methodology Overview Set of rules for interconnecting components and clocks When followed, guarantee proper operation.
Lecture 11: Sequential Circuit Design. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 11: Sequential Circuits2 Outline  Sequencing  Sequencing Element Design.
CS 151 Digital Systems Design Lecture 19 Sequential Circuits: Latches.
Delay/Phase Regeneration Circuits Crescenzo D’Alessandro, Andrey Mokhov, Alex Bystrov, Alex Yakovlev Microelectronics Systems Design Group School of EECE.
Circuits require memory to store intermediate data
1 Clockless Logic  Recap: Lookahead Pipelines  High-Capacity Pipelines.
Pipeline transfer testing. The purpose of pipeline transfer increase the bandwidth for synchronous slave peripherals that require several cycles to return.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis EE4800 CMOS Digital IC Design & Analysis Lecture 11 Sequential Circuit Design Zhuo Feng.
Introduction to CMOS VLSI Design Lecture 19: Design for Skew David Harris Harvey Mudd College Spring 2004.
Introduction to CMOS VLSI Design Clock Skew-tolerant circuits.
Synchronous Digital Design Methodology and Guidelines
Clock Design Adopted from David Harris of Harvey Mudd College.
© Ran Ginosar Lecture 3: Handshake Ckt Implementations 1 VLSI Architectures Lecture 3 S&F Ch. 5: Handshake Ckt Implementations.
1 A Modular Synchronizing FIFO for NoCs Vainbaum Yuri.
ENGIN112 L30: Random Access Memory November 14, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 30 Random Access Memory (RAM)
1 Clockless Logic Montek Singh Thu, Jan 13, 2004.
1 Clockless Logic Montek Singh Tue, Mar 23, 2004.
1 Clockless Logic Montek Singh Tue, Mar 16, 2004.
COMP Clockless Logic and Silicon Compilers Lecture 3
1 Clockless Logic Prof. Montek Singh Feb. 3, 2004.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
VHDL Coding Exercise 4: FIR Filter. Where to start? AlgorithmArchitecture RTL- Block diagram VHDL-Code Designspace Exploration Feedback Optimization.
1 Clockless Logic Montek Singh Tue, Mar 21, 2006.
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths Montek Singh and Steven Nowick Columbia University New York, USA
CS 151 Digital Systems Design Lecture 30 Random Access Memory (RAM)
Chapter #6: Sequential Logic Design 6.2 Timing Methodologies
Introduction to CMOS VLSI Design Lecture 10: Sequential Circuits Credits: David Harris Harvey Mudd College (Material taken/adapted from Harris’ lecture.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 18, 2002 Topic: Main Memory (DRAM) Organization – contd.
Avshalom Elyada, Ran GinosarPipeline Synchronization 1 Pipeline Synchronization Continued This second part is based on the recent article Bridging Clock.
Lecture 11 MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines.
1 Recap: Lectures 5 & 6 Classic Pipeline Styles 1. Williams and Horowitz’s PS0 pipeline 2. Sutherland’s micropipelines.
1 Clockless Logic: Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines.
Clockless Logic Montek Singh Tue, Apr 6, Case Study: An Adaptively-Pipelined Mixed Synchronous-Asynchronous System Montek Singh Univ. of North Carolina.
Digital Integrated Circuits for Communication
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
Ratioed Circuits Ratioed circuits use weak pull-up and stronger pull-down networks. The input capacitance is reduced and hence logical effort. Correct.
Introduction to CMOS VLSI Design Lecture 5: Logical Effort GRECO-CIn-UFPE Harvey Mudd College Spring 2004.
SEQUENTIAL CIRCUITS Component Design and Use. Register with Parallel Load  Register: Group of Flip-Flops  Ex: D Flip-Flops  Holds a Word of Data 
1 Clockless Computing Montek Singh Thu, Sep 6, 2007  Review: Logic Gate Families  A classic asynchronous pipeline by Williams.
Lecture 10: Circuit Families. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 10: Circuit Families2 Outline  Pseudo-nMOS Logic  Dynamic Logic  Pass Transistor.
Reading1: An Introduction to Asynchronous Circuit Design Al Davis Steve Nowick University of Utah Columbia University.
1 COMP541 Sequential Circuits Montek Singh Feb 1, 2012.
UNIVERSITY OF ROSTOCK Institute of Applied Microelectronics and Computer Science Single-Rail Self-timed Logic Circuits in Synchronous Designs Frank Grassert,
12004 MAPLD: 153Brej Early output logic and Anti-Tokens Charlie Brej APT Group Manchester University.
Reader: Pushpinder Kaur Chouhan
COMP541 Arithmetic Circuits
1 Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University.
1 Bridging the gap between asynchronous design and designers Peter A. BeerelFulcrum Microsystems, Calabasas Hills, CA, USA Jordi CortadellaUniversitat.
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
1 Recap: Lecture 4 Logic Implementation Styles:  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates, or “pass-transistor” logic.
FAMU-FSU College of Engineering EEL 3705 / 3705L Digital Logic Design Spring 2007 Instructor: Dr. Michael Frank Module #10: Sequential Logic Timing & Pipelining.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
1 Clockless Logic Montek Singh Thu, Mar 2, Review: Logic Gate Families  Static CMOS logic  Dynamic logic, or “domino” logic  Transmission gates,
Lecture 11: Sequential Circuit Design
Recap: Lecture 1 What is asynchronous design? Why do we want to study it? What is pipelining? How can it be used to design really fast hardware?
FPGA Implementation of Multicore AES 128/192/256
COMP541 Sequential Circuits
High Performance Asynchronous Circuit Design and Application
Clockless Logic: Asynchronous Pipelines
Wagging Logic: Moore's Law will eventually fix it
Clockless Computing Lecture 3
Presentation transcript:

1 Clockless Computing Montek Singh Thu, Sep 13, 2007

2 Dynamic Logic Pipelines (contd.)  Drawbacks of Williams’ PS0 Pipelines  Lookahead Pipelines [Singh/Nowick 2000]  High-Capacity Pipelines [Singh/Nowick 2000]

3 Drawbacks of PSO Pipelining 1. Poor throughput: long cycle time: 6 events per cycle long cycle time: 6 events per cycle data “tokens” are forced far apart in time data “tokens” are forced far apart in time 2. Limited storage capacity: max only 50% of stages can hold distinct tokens max only 50% of stages can hold distinct tokens data tokens must be separated by at least one spacer data tokens must be separated by at least one spacer My Research Goals have been: address both issues still maintain very low latency still maintain very low latency

4 Recent Approaches 3 novel styles for high-speed async pipelining: MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] Goal: significantly improve throughput of PS0 Two Distinct Strategies: LP: introduce protocol optimizations LP: introduce protocol optimizations  “shave off” components from critical cycle HC: fundamentally new protocol HC: fundamentally new protocol  greater concurrency: “loosely-coupled” stages  

5Outline è New Asynchronous Pipelines: MOUSETRAP Pipelines MOUSETRAP Pipelines è Lookahead Pipelines (LP) High-Capacity Pipelines (HC) High-Capacity Pipelines (HC) Dynamic circuit style Static circuit style

6 Lookahead Pipeline Styles Singh and Nowick Async-2000 [Best Paper Award]

7 Lookahead Pipelines: Strategy #1 Use non-neighbor communication: stage receives information from multiple later stages stage receives information from multiple later stages allows “early evaluation” allows “early evaluation” Benefit: stage gets head-start on next cycle

8 Lookahead Pipelines: Strategy #2 Use early completion detection: completion detector moved before stage (not after) completion detector moved before stage (not after) stage indicates “early done” in parallel with computation stage indicates “early done” in parallel with computation Benefit: again, stage gets head-start on next cycle early completion detector

9 Lookahead Pipelines: Overview 5 New Designs: è“Dual-Rail” Data Signaling: LP3/1: “early evaluation” LP3/1: “early evaluation” LP2/2: “early done” LP2/2: “early done” LP2/1: “early evaluation” + “early done” LP2/1: “early evaluation” + “early done”  “Single-Rail” Bundled-Data Signaling: LP SR 2/2: “early done” LP SR 2/2: “early done” LP SR 2/1: “early evaluation” + “early done” LP SR 2/1: “early evaluation” + “early done”

10 Optimization = “early evaluation” each stage has two control inputs: from stages N+1 and N+2 each stage has two control inputs: from stages N+1 and N+2 Idea: shorten precharge phase terminate precharge early: when N+2 is done evaluating terminate precharge early: when N+2 is done evaluating Dual-Rail Design #1: LP3/1 Data in Data out PC Eval From N+2 N N+1 N+2 Processing Block Completion Detector

11 LP3/1 Protocol LP3/1 Protocol PRECHARGE N: when N+1 completes evaluation PRECHARGE N: when N+1 completes evaluation EVALUATE N: when N+2 completes evaluation EVALUATE N: when N+2 completes evaluation New! Enables “early evaluation!” 4 N evaluates N+1 evaluates N+2 indicates “done” N+2 evaluates N N+1 N+2 N+1 indicates “done” 3

12 PS0PS0 LP3/1LP3/1 LP3/1: Comparison with PS NN+1N+2 NN+1N+2 Enables “early evaluation!” 1 1 evaluates evaluates 2 2 evaluates evaluates 3 3 evaluates evaluates Only 4 events in cycle! 6 events in cycle PRECHARGE N: when N+1 completes evaluation 3 indicates “done” 3 EVALUATE N: when N+2 completes evaluation EVALUATE N: when N+1 completes precharging

LP3/1 Performance Cycle Time = saved path Savings over PS0: 1 Precharge + 1 Completion Detection

14 LP3/1: Inside a Stage Timing Issues:  must satisfy several simple constraints  Ex.: PC must arrive before Eval de-asserted 1-sided timing requirement 1-sided timing requirement easily satisfied in practice easily satisfied in practice PC (From Stage N+1) Eval (From Stage N+2) NAND “early Eval” “old Eval” Merging 2 Control Inputs:

15 Dual-Rail Design #2: LP2/2 Optimization = “early done” Idea: move completion detector before processing block Idea: move completion detector before processing block  stage indicates when “about to” precharge/evaluate Processing Block “early” Completion Detector Data in Data out “early done”

16 LP2/2 Completion Detector Modified completion detectors needed: Done =1 when stage starts evaluating, and inputs valid Done =1 when stage starts evaluating, and inputs valid Done =0 when stage starts precharging Done =0 when stage starts precharging  asymmetric C-element C Done OR bit 0 OR bit 1 OR bit n + + +PC

LP2/2 Protocol Completion Detection: performed in parallel with evaluation/precharge of stage N evaluates N+1 evaluates N N+1 N+2 2 “early done” of N+1 eval 3 3 “early done” of N+2 eval “early done” of N+1 prech

18 LP2/2 Performance LP2/2 savings over PS0: 1 Evaluation + 1 Precharge Cycle Time =

19 Dual-Rail Design #3: LP2/1 Hybrid of LP3/1 and LP2/2. Combines: early evaluation of LP3/1 early evaluation of LP3/1 early done of LP2/2 early done of LP2/2 Cycle Time =

20 Lookahead Pipelines: Overview 5 New Designs:  “Dual-Rail” Data Signaling: LP3/1: “early evaluation” LP3/1: “early evaluation” LP2/2: “early done” LP2/2: “early done” LP2/1: “early evaluation” + “early done” LP2/1: “early evaluation” + “early done” è“Single-Rail” Bundled-Data Signaling: LP SR 2/2: “early done” LP SR 2/2: “early done” LP SR 2/1: “early evaluation” + “early done” LP SR 2/1: “early evaluation” + “early done”

21 Single-Rail Design: LP SR 2/1 Derivative of LP2/1, adapted to single-rail:  bundled-data: matched delays instead of completion detectors delaydelay delay “Ack” to previous stages is “tapped off early”  once in evaluate (precharge), dynamic logic insensitive to input changes

22 PC and Eval are combined exactly as in LP3/1 Inside an LP SR 2/1 Stage “done” generated by an asymmetric C-element done =1 when stage evaluates, and data inputs valid done =1 when stage evaluates, and data inputs valid done =0 when stage precharges done =0 when stage precharges PC (From Stage N+1) Eval (From Stage N+2) NAND aC + “ack” “req” in data in data out “req” out matched delay done

23 LP SR 2/1 Protocol Cycle Time = N evaluates N+2 evaluates N+2 indicates “done” N N+1 N+2 2 N+1 evaluates N+1 indicates “done”

24 dual-rail single-rail FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 comparable latency comparable latency LP single-rail: even faster 0.19  CMOS 3.3 V, 300°K

25 datapath width = 32 dual-rail bits! Practicality of Gate-Level Pipelining When datapath is wide:  Can often split into narrow “streams”  comp. d et. f airly low cost!  Use “localized” completion detector for each stream: for each stream: need to examine only a few bits need to examine only a few bits  small fan-in  small fan-in send “done” to only a few gates send “done” to only a few gates  small fan-out  small fan-outdone fan-out=2 comp. det. fan-in = 2

26 High-Capacity Pipelines Singh/Nowick WVLSI-00, ISSCC-02, Async-02

27 HC Pipeline Style High-Capacity Pipelines (HC) bundled datapaths; dynamic logic function blocks bundled datapaths; dynamic logic function blocks latch-free: no explicit latches needed latch-free: no explicit latches needed  dynamic logic provides implicit latching novel highly-concurrent protocol maximizes storage capacity novel highly-concurrent protocol maximizes storage capacity  traditional latch-free approaches: “spacers” limit capacity to 50% Key Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-down separate control of pull-up/pull-down result = new “isolate phase” result = new “isolate phase” stage holds outputs/impervious to input changes stage holds outputs/impervious to input changes Advantage: Each stage can hold a distinct data item è 100% storage capacity Extra Benefit: Obtain greater concurrency  High throughput

28 HC: Basic Structure Key Idea: 2 independent control signals: pc: controls precharge pc: controls precharge eval: controls evaluation eval: controls evaluation Allows novel 3-phase cycle: Evaluate Evaluate “Isolate” (hold) “Isolate” (hold) Precharge Precharge delay stagecontroller pceval ack N N+1N+2 delay Single-rail “Bundled Datapath”: l matched delay: produces delayed “done” signal  worst-case delay: longer than slowest path for data delay

29 HC: Inside a Stage Independent Controls of pull-up and pull-down: allows new 3 rd phase: “isolate” allows new 3 rd phase: “isolate” l pc asserted: precharge l eval asserted: evaluate l pc and eval de-asserted: enter “isolate” (hold) phase “keeper”controlsevaluationcontrolsprecharge eval inputs outputs pc

30 HC: Protocol Most Existing Protocols: 3 synchronization arcs 1 forward arc: data dependency 1 forward arc: data dependency 2 backward arcs: control synchronization 2 backward arcs: control synchronization Our protocol: only 2 synchronization arcs only 1 backward arc only 1 backward arc  once stage N+1 evaluates, N can complete entire next cycle! Eval Isolate Precharge pc=1 eval=1 pc=1 eval=0 pc=0 eval=0 Eval Isolate Precharge Stage N Stage N+1 X

31 Formal Specification of Controller Problem: Specification too concurrent for direct synthesis desired precharge condition: N and N+1 have evaluated same data desired precharge condition: N and N+1 have evaluated same data problem: this condition not uniquely captured by given signals! problem: this condition not uniquely captured by given signals!  N may evaluate next data item, while N+1 stuck on current item! T+ T- (Evaluate of N+1 complete) (Precharge of N+1 complete) pc+ eval+ S+ eval- pc- S- (Startevaluate) (Evaluatecomplete) (Isolate) (Startprecharge) (Prechargecomplete)

32 Modified Specification of Controller Solution: Add a state variable ok2pc ok2pc records whether N+1 has “absorbed” N’s data item  ok2pc resets immediately when N deletes item (N precharges)  ok2pc is set when N+1 deletes item (N+1 precharges) ok2pc+ ok2pc- pc+eval+ S+ eval- pc- S- T+ T- (Evaluate of N+1 complete) (Precharge of N+1 complete)

33 Controller implementation Controller implementation is very simple: each signal implemented using a single gate each signal implemented using a single gate ok2pc typically off the critical path ok2pc typically off the critical path INV NAND3 aC + S TST ok2pc pc eval S

34 + eval pc HC: Stage Implementation req done ack NAND INV delay state variable: off the critical path off the critical path from current stage self-loop: key to fast key to fast “isolation” “isolation” from next stage early ack

35 HC: Operation 1 NN+1 N evaluates N+1 starts to evaluate evaluate N precharges N enables itself for next evaluation 2 3 (fastself-loop) N isolates (fastself-loop) (early Ack) Cycle Time = 8 CMOS gate delays

36 N enables itself for next evaluation N precharges Performance1 Cycle Time = N evaluates N N+1N+2 N+1 evaluates 3 2 N isolates 2

37 dual-rail single-rail FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 comparable latency comparable latency LP single-rail: even faster 0.19  CMOS 3.3 V, 300°K

38 Fabricated Chip: HC FIFO  2.5 GHz in 0.18u