ASPDAC / VLSI 2002 - Tutorial on "Large Asynchronous Systems" 1 Logic design of asynchronous circuits Part IV: Large Asynchronous Systems.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

DSPs Vs General Purpose Microprocessors

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

CSCI 4717/5717 Computer Architecture

Lecture 12 Reduce Miss Penalty and Hit Time

Southampton: Oct 99AMULET3i - 1 AMULET3i - asynchronous SoC Steve Furber - n Agenda: AMULET3i Design tools Future problems.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Instruction-Level Parallelism (ILP)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.

Southampton: Oct 99Asynchronous Circuit Compilation- 1 AMULET3-H n Asynchronous macrocell ARM compatible processor core Full custom RAM Compiled ROM Balsa.

The ARM7TDMI Hardware Architecture

11-May-04 Qianyi Zhang School of Computer Science, University of Birmingham (Supervisor: Dr Georgios Theodoropoulos) A Distributed Colouring Algorithm.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computer Organization and Architecture The CPU Structure.

ELEC 6200, Fall 07, Oct 24 Jiang: Async. Processor 1 Asynchronous Processor Design for ELEC 6200 by Wei Jiang.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Data Manipulation Computer System consists of the following parts:

Pipelined Processor II CPSC 321 Andreas Klappenecker.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

CPU Chips The logical pinout of a generic CPU. The arrows indicate input signals and output signals. The short diagonal lines indicate that multiple pins.

Pipelining By Toan Nguyen.

Prardiva Mangilipally

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

ARM Processor Architecture

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Computer Architecture Lecture 08 Fasih ur Rehman.

The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.

Advanced Higher Computing  Computer Architecture  Chapter 2.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

MICROPROCESSOR INPUT/OUTPUT

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Top Level View of Computer Function and Interconnection.

Computer Architecture Lecture10: Input/output devices Piotr Bilski.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

EEE440 Computer Architecture

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

The Intel 86 Family of Processors

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Fundamentals of Programming Languages-II

بسم الله الرحمن الرحيم MEMORY AND I/O.

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

DIRECT MEMORY ACCESS and Computer Buses

William Stallings Computer Organization and Architecture 8th Edition

CSC 4250 Computer Architectures

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Architecture & Organization 1

Control unit extension for data hazards

* From AMD 1996 Publication #18522 Revision E

High Performance Asynchronous Circuit Design and Application

Control unit extension for data hazards

Control unit extension for data hazards

Presentation transcript:

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 1 Logic design of asynchronous circuits Part IV: Large Asynchronous Systems

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 2 Why Asynchronous Logic? Low Power Do nothing when there is nothing to be done Modularity Added design freedom and component reusability Electromagnetic Interference (EMI) Clocks concentrate noise energy at particular frequencies Security? Surprise the hackers!

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 3 Contents Some asynchronous microprocessors Problems in processor design Problems and opportunities in asynchronous design Example solutions to a selection of problems Memories and peripherals Design styles GALS and asynchronous interconnection Conclusions

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 4 Why Microprocessors? Well defined problem Easy to demonstrate correct function Self-contained Make stand-alone devices Not obviously suited to asynchronous techniques Forced to examine ‘real’ problems Interesting problems New techniques to devise

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 5 Microprocessors as Design Examples AMULET1 (1994) ARM6 compatible processor (almost) 1.0um transistors Hand designed Bundled data, two-phase control Feasibility study Experiences Two-phase logic hard to work with and interface to

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 6 Microprocessors as Design Examples AMULET2e (1996) ARM7 compatible processor Asynchronous cache 0.5um transistors Hand designed Bundled data, four-phase control Experiences: Easy to use, self-contained system

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 7 Microprocessors as Design Examples AMULET3i (2000) ARM9 compatible processor SoC: Memory, DMA controller, bus, … 0.35um transistors Mostly hand designed Bundled data, two-phase control Commercial application (DRACO) Experiences: Universities don’t have the resources for such projects!

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 8 Microprocessors as Design Examples SPA (2002) ARM10 compatible processor 0.18um well over 1 million transistors Mostly synthesized Dual-rail, four-phase control Secure smartcard chip Experiences: To be confirmed ?

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 9 Other Asynchronous Microprocessors “Caltech Asynchronous Microprocessor”(1989) First asynchronous microprocessor University of Tokyo’s “TITAC-2” (1997) Mostly hand designed Caltech “MiniMIPS” (R3000) (1997) Hand designed

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 10 Other Asynchronous Microprocessors IMAG (Grenoble) “ASPRO-216” (1998) 16-bit signal processor Philips Research Laboratories 80C51 (1998) Synthesized using “Tangram” tool Theseus Star-8 (2000) Uses Null Convention Logic (NCL)

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 11 AMULET3 Processor Architecture Highly pipelined structure Similar to synchronous architecture Features: Branch prediction Halting Forwarding Out-of-order completion Precise exceptions

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 12 Synchronous vs. Asynchronous Architectures An AMULET looks very like a synchronous ARM Functional blocks divided by pipeline latches But: Some ideas can be ‘copied', some need reinvention Some synchronous ‘tricks’ don’t work in an asynchronous environment Non-local interactions, dependency resolution,... Pipelining is too easy Temptation to inefficient design

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 13 Data-dependent Timing ARM instructions allow a shift before an ALU operation These are rarely exploited The shifter is often bypassed The execution timing may be adjusted appropriately

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 14 Process-level Parallelism Example: decode and execute stages Various threads – many invoked conditionally Skewed pipeline latches (lower power EMI) Variable stage delay (e.g. ‘stretch’ for series shift) Differing pipeline depths (extra buffer for LDM/STM) Conditional invocation of functions

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 15 PC Pipeline In a synchronous design non-local values can be read The ARM uses the PC as an operand (e.g. relative branch) The actual value is two instructions ‘out’ due to the pipeline depth of the original implementation

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 16 PC Pipeline In an asynchronous design only ‘adjacent’ stages can communicate AMULET supplies the PC value with every instruction This can be adjusted as required in an implementation independent manner

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 17 Thumb Expansion Handshaking allows pipeline occupancy to be changed without global control Thumb instructions normally fetched in pairs but executed individually Same scheme applied to (e.g.) ‘Load Multiple’ Similar scheme applied to removing ‘surplus’ packets

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 18 Halting Any unit can impose an arbitrary delay A pipeline stage could choose to delay indefinitely This will halt the whole pipeline shortly thereafter Downstream stages ‘drain’ Upstream stages ‘back up’ Halted CMOS circuits use ‘no’ power No clock power either Restart is instantaneous Both AMULET2 and AMULET3 exploit this for easy power management

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 19 Colouring Branches change the local colour and request a new stream Prefetched operations discarded until a new stream arrives Pipeline occupancy may be non-deterministic

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 20 Deadlock Question: if pipeline occupancy is variable, what happens if a token is inserted into a ‘full’ pipeline? Answer: Deadlock! a danger with the large number of states available can be avoided with careful design Two cases in AMULET3: Branches when the prefetch pipeline is full Memory conflict between instruction and data fetches Both were known early and prevented at a higher level

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 21 Data Aborts Wait for MMU to abort or not Stretch cycle if memory access

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 22 Data Aborts Speculate on memory not aborting Register results returned out of order More parallelism => higher throughput

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 23 Reorder Buffer Allows instructions to complete in any order Resolves register dependencies Allows register forwarding Permits low-overhead memory management Supports exact page fault exceptions

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 24 Reorder Buffer Data can arrive along any path at any time, providing their targets are mutually exclusive Read out waits for each register to be filled in turn, then copies out the result (or not, if unwanted) Copy out frees the register but does not delete the data

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 25 AMULET3i Memory System The RAM is ‘dual-port’ (at this level) The instruction bus is simpler so it has a higher bandwidth

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 26 Memory Structure The local RAM is divided into 1 Kbyte blocks Unified RAM model Close to dual-port efficiency About 50% instruction fetches are from the ‘Ibuffers’

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 27 AMULET2e Cache Pipelined Data-dependent timing Asynchronous line fetch Newer design includes: Copy-back Write buffer Victim cache

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 28 Synthesis vs. Hand Design Most of the AMULET3i system was designed at schematic level Part (the DMA controller) was a test of Balsa A new asynchronous synthesis system Synthesised blocks are more efficient to design but less efficient in operation Suitable for (e.g.) peripherals that are rarely invoked No timing closure problems!

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 29 DMAC About transistors Regular structures (register banks) in full custom Control synthesised from Balsa description Cheats slightly by letting a clock into one corner!

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 30 SPA A project to produce a synthesisable ARM core in Balsa Simple 3-stage pipelining Omits many ‘performance’ features Uses dual-rail coding to enhance security Retargettable to any process, including dual-rail, 1-of-N codes etc. by recompilation

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 31 AMULET3 vs. SPA

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 32 Asynchronous on-chip Interconnection MARBLE Centrally arbitrated, multi-channel, asynchronous on-chip bus Supports: 8-, 16- and 32-bit transfers, bus locking, sequential bursts, … Separate, decoupled, asynchronous transfer phases for address and data 32-bit bundled data pathways Used on AMULET3i Standard ‘master’ and ‘slave’ interfaces Standard interface to on-chip synchronous bus too

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 33 Asynchronous on-chip Interconnection CHAIN Delay insensitive coding for ‘distance’ transmission Requires more wires per bit Exploit lack of clock to send serial symbols fast 4 wires, 2-bit (1-of-4) symbols Point-to-point unidirectional wiring Standard ‘master’ and ‘slave’ interfaces Could easily provide standard synchronous interfaces

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 34 GALS Globally Asynchronous, Locally Synchronous interconnection Use ‘conventional’ synchronous design blocks for SoC Use asynchronous interconnection to avoid timing closure problems May be the first big application of asynchronous logic No reason why the ‘local’ blocks need to be synchronous …

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 35 AMULET3i AMULET3 microprocessor (ARMv4T) 8 Kbytes RAM 16 Kbytes ROM Flexible, multi-channel DMAC Programmable memory interface On-chip asynchronous bus (MARBLE) Bridge to on-chip synchronous bus Configuration registers Software debug support Test interface

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 36 Experience of Large Asynchronous Designs Hard, but feasible Competitive Advantages? Power management EMI Composability (GALS) Security? Commercial Philips, Theseus, ADD, Intel?, …

ASPDAC / VLSI Tutorial on "Large Asynchronous Systems" 37 Conclusions Asynchronous logic: Can be competitive with ‘conventional’ designs Has advantages with low-power and low EMI think portable systems May be the only solution for some tasks block interconnections on large chips but Designing big systems is a lot of work It’s hard to catch up with the big companies