High Performance Asynchronous Circuit Design and Application

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Final Project : Pipelined Microprocessor Joseph Kim.
1 ITCS 3181 Logic and Computer Systems B. Wilkinson Slides9.ppt Modification date: March 30, 2015 Processor Design.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
The Pipelined CPU Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Revised 9/22/2013.
Computer Organization and Architecture
Slide 1/20IWLS 2003, May 30Early Output Logic with Anti-Tokens Charlie Brej, Jim Garside APT Group Manchester University.
Computer Organization and Architecture
Computer Organization and Architecture
Chapter 6 Pipelining & RISCs Dr. Abraham Techniques for speeding up a computer Pipelining Parallel processing.
1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
CPU Architecture Why not single cycle? Why not single cycle? Hardware complexity Hardware complexity Why not pipelined? Why not pipelined? Time constraints.
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
Introduction to Computing Systems from bits & gates to C & beyond Chapter 4 The Von Neumann Model Basic components Instruction processing.
CPU Design. Introduction – The CPU must perform three main tasks: Communication with memory – Fetching Instructions – Fetching and storing data Interpretation.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Lecture 14: Processors CS 2011 Fall 2014, Dr. Rozier.
Performed By: Yahel Ben-Avraham and Yaron Rimmer Instructor: Mony Orbach Bi-semesterial, /3/2013.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Fetch-execute cycle.
12004 MAPLD: 153Brej Early output logic and Anti-Tokens Charlie Brej APT Group Manchester University.
IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
1  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing.
Utopium. ● MCU8051 processor ● Several asynchronous examples ● 256 instructions – Some reather complex ● 256 bytes of ram with memory mapped: – Register.
Dr.Ahmed Bayoumi Dr.Shady Elmashad
CS161 – Design and Architecture of Computer Systems
Computer Organization
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Lecture 16: Basic Pipelining
Morgan Kaufmann Publishers
Performance of Single-cycle Design
MIPS Instructions.
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Morgan Kaufmann Publishers The Processor
Pipeline Implementation (4.6)
CDA 3101 Spring 2016 Introduction to Computer Organization
Design of the Control Unit for Single-Cycle Instruction Execution
Charlie Brej APT Group University of Manchester
Lecture 16: Basic Pipelining
Blame Passing for Analysis and Optimisation
Lecture 5: Pipelining Basics
Serial versus Pipelined Execution
Topic 6 LC-3.
Pipelining in more detail
A Multiple Clock Cycle Instruction Implementation
CSCI206 - Computer Organization & Programming
Rocky K. C. Chang 6 November 2017
Pipelining Chapter 6.
Guest Lecturer TA: Shreyas Chand
Pipelining: Basic Concepts
Multi-Cycle Datapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Pipelining (II).
Morgan Kaufmann Publishers The Processor
Introduction to Computer Organization and Architecture
Basic components Instruction processing
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Wagging Logic: Moore's Law will eventually fix it
A Quasi-Delay-Insensitive Method to Overcome Transistor Variation
Early output logic and Anti-Tokens
Pipelined datapath and control
Presentation transcript:

High Performance Asynchronous Circuit Design and Application Charlie Brej APT Group University of Manchester 18/01/2019 Async Forum

Introduction Async performance Wagging Logic Red Star Conclusions Asynchronous logic is slow Wagging Logic Example circuits Red Star Design Results Conclusions 18/01/2019 Async Forum

Data propagation Logic C C C C C C C C Latency Cycle Time 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum

Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1 1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum

Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1 1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum

And then it gets worse Latency is at least six times lower than the cycle time Assumes all data arrives at arrive at the same time Assumes all acknowledgements arrive at the same time Actual number is somewhere between 10 and 100 18/01/2019 Async Forum

What can we do Use two-phase signalling Fine grain pipelining Halve the control delay Loose all average case advantages Fine grain pipelining Need to add 10+ latches per stage Adds latency Faster completion Anti-tokens, Early-drop latches… Careful timing analysis 18/01/2019 Async Forum

Wagging Latches Alternate latch read/write Capacity of two latches Depth of one latch 18/01/2019 Async Forum

Wagging Logic Apply same method to the logic Rotate logic allowing one to set while others reset Set Reset Reset 18/01/2019 Async Forum

Single Channel Mixer 18/01/2019 Async Forum

LCM Channels Mixer 18/01/2019 Async Forum

Direct Connection Mixer 18/01/2019 Async Forum

32bit Incrementer Example Reg +1 N Reg N +1 Reg Reg +1 +1 18/01/2019 Async Forum

32bit Incrementer Example Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 18/01/2019 Async Forum

32bit Incrementer Optimal Design: 3288 Operations 3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation 18/01/2019 Async Forum

32bit Accumulator Example Load or Accumulate 18/01/2019 Async Forum

32bit Accumulator Example Load Accumulate Accumulate Load Accumulate Load 18/01/2019 Async Forum

32bit Accumulator Example 18/01/2019 Async Forum

Red Star MIPS ISA Fast and simple development 32bit RISC Fast and simple development Use synchronous design methodology Complicated features without complicated design effort OOO execution, banked caching… 18/01/2019 Async Forum

Red Star 18/01/2019 Async Forum

Register Bank 18/01/2019 Async Forum

ADD R1, R1, #1 1401 Operations 7.14 GDs per operation 18/01/2019 Async Forum

Additional unnecessary stages to extend the branch shadow Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow 18/01/2019 Async Forum

Overlapping Instructions Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack 18/01/2019 Async Forum

Five Instruction Loop 18/01/2019 Async Forum

Caching RAM Slice 0 Cache Slice 1 Cache 1 1 2 2 3 3 4 5 6 7 Slice 2 1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache 18/01/2019 Async Forum

Caching RAM Slice 0 Cache Slice 1 Cache 1 1 1 1 2 2 2 3 4 5 6 7 1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache 18/01/2019 Async Forum

Caching If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP 18/01/2019 Async Forum

Caching Instead of one large 16Kb cache 16 small 1Kb caches 12bit address 16 small 1Kb caches 8bit address Approximately 50% faster lookup No data duplication 18/01/2019 Async Forum

Two Nasty Loops (7 and 17) 18/01/2019 Async Forum

Area 45,000 gates per slice Approx 6 million transistors (16 way) 15,000 gates without the register bank Approx 6 million transistors (16 way) 2 million without the register bank Final design ~10 million transistors 18/01/2019 Async Forum

How much is 10 million? 18/01/2019 Async Forum

Future work Very early in development Clumsy completion logic One week of development Clumsy completion logic Slowest path analysis Remove unnecessary dependencies Improve worst case latency Target of 5 gate delays per instruction Parallel instruction execution Removing unnecessary latches 18/01/2019 Async Forum

Distant Future Work Simplification of completion logic Use timing assumptions on the reset phase Halve the area Redundant slices Bypass broken slices 18/01/2019 Async Forum

Conclusions Method of producing very fast circuits Minimal design effort Minimal experience required Implicit data dependency Eager evaluation Many improvements possible Area could be halved Performance of 5 gate delays per instruction 18/01/2019 Async Forum