Download presentation
Presentation is loading. Please wait.
Published byKarol Witkowski Modified over 6 years ago
1
High Performance Asynchronous Circuit Design and Application
Charlie Brej APT Group University of Manchester 18/01/2019 Async Forum
2
Introduction Async performance Wagging Logic Red Star Conclusions
Asynchronous logic is slow Wagging Logic Example circuits Red Star Design Results Conclusions 18/01/2019 Async Forum
3
Data propagation Logic C C C C C C C C Latency Cycle Time 1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum
4
Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1
1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum
5
Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1
1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum
6
And then it gets worse Latency is at least six times lower than the cycle time Assumes all data arrives at arrive at the same time Assumes all acknowledgements arrive at the same time Actual number is somewhere between 10 and 100 18/01/2019 Async Forum
7
What can we do Use two-phase signalling Fine grain pipelining
Halve the control delay Loose all average case advantages Fine grain pipelining Need to add 10+ latches per stage Adds latency Faster completion Anti-tokens, Early-drop latches… Careful timing analysis 18/01/2019 Async Forum
8
Wagging Latches Alternate latch read/write Capacity of two latches
Depth of one latch 18/01/2019 Async Forum
9
Wagging Logic Apply same method to the logic
Rotate logic allowing one to set while others reset Set Reset Reset 18/01/2019 Async Forum
10
Single Channel Mixer 18/01/2019 Async Forum
11
LCM Channels Mixer 18/01/2019 Async Forum
12
Direct Connection Mixer
18/01/2019 Async Forum
13
32bit Incrementer Example
Reg +1 N Reg N +1 Reg Reg +1 +1 18/01/2019 Async Forum
14
32bit Incrementer Example
Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 18/01/2019 Async Forum
15
32bit Incrementer Optimal Design: 3288 Operations
3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation 18/01/2019 Async Forum
16
32bit Accumulator Example
Load or Accumulate 18/01/2019 Async Forum
17
32bit Accumulator Example
Load Accumulate Accumulate Load Accumulate Load 18/01/2019 Async Forum
18
32bit Accumulator Example
18/01/2019 Async Forum
19
Red Star MIPS ISA Fast and simple development
32bit RISC Fast and simple development Use synchronous design methodology Complicated features without complicated design effort OOO execution, banked caching… 18/01/2019 Async Forum
20
Red Star 18/01/2019 Async Forum
21
Register Bank 18/01/2019 Async Forum
22
ADD R1, R1, #1 1401 Operations 7.14 GDs per operation 18/01/2019
Async Forum
23
Additional unnecessary stages to extend the branch shadow
Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow 18/01/2019 Async Forum
24
Overlapping Instructions
Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack 18/01/2019 Async Forum
25
Five Instruction Loop 18/01/2019 Async Forum
26
Caching RAM Slice 0 Cache Slice 1 Cache 1 1 2 2 3 3 4 5 6 7 Slice 2
1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache 18/01/2019 Async Forum
27
Caching RAM Slice 0 Cache Slice 1 Cache 1 1 1 1 2 2 2 3 4 5 6 7
1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache 18/01/2019 Async Forum
28
Caching If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC
RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP 18/01/2019 Async Forum
29
Caching Instead of one large 16Kb cache 16 small 1Kb caches
12bit address 16 small 1Kb caches 8bit address Approximately 50% faster lookup No data duplication 18/01/2019 Async Forum
30
Two Nasty Loops (7 and 17) 18/01/2019 Async Forum
31
Area 45,000 gates per slice Approx 6 million transistors (16 way)
15,000 gates without the register bank Approx 6 million transistors (16 way) 2 million without the register bank Final design ~10 million transistors 18/01/2019 Async Forum
32
How much is 10 million? 18/01/2019 Async Forum
33
Future work Very early in development Clumsy completion logic
One week of development Clumsy completion logic Slowest path analysis Remove unnecessary dependencies Improve worst case latency Target of 5 gate delays per instruction Parallel instruction execution Removing unnecessary latches 18/01/2019 Async Forum
34
Distant Future Work Simplification of completion logic
Use timing assumptions on the reset phase Halve the area Redundant slices Bypass broken slices 18/01/2019 Async Forum
35
Conclusions Method of producing very fast circuits
Minimal design effort Minimal experience required Implicit data dependency Eager evaluation Many improvements possible Area could be halved Performance of 5 gate delays per instruction 18/01/2019 Async Forum
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.