High Performance Asynchronous Circuit Design and Application Charlie Brej APT Group University of Manchester 18/01/2019 Async Forum
Introduction Async performance Wagging Logic Red Star Conclusions Asynchronous logic is slow Wagging Logic Example circuits Red Star Design Results Conclusions 18/01/2019 Async Forum
Data propagation Logic C C C C C C C C Latency Cycle Time 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum
Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1 1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum
Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1 1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum
And then it gets worse Latency is at least six times lower than the cycle time Assumes all data arrives at arrive at the same time Assumes all acknowledgements arrive at the same time Actual number is somewhere between 10 and 100 18/01/2019 Async Forum
What can we do Use two-phase signalling Fine grain pipelining Halve the control delay Loose all average case advantages Fine grain pipelining Need to add 10+ latches per stage Adds latency Faster completion Anti-tokens, Early-drop latches… Careful timing analysis 18/01/2019 Async Forum
Wagging Latches Alternate latch read/write Capacity of two latches Depth of one latch 18/01/2019 Async Forum
Wagging Logic Apply same method to the logic Rotate logic allowing one to set while others reset Set Reset Reset 18/01/2019 Async Forum
Single Channel Mixer 18/01/2019 Async Forum
LCM Channels Mixer 18/01/2019 Async Forum
Direct Connection Mixer 18/01/2019 Async Forum
32bit Incrementer Example Reg +1 N Reg N +1 Reg Reg +1 +1 18/01/2019 Async Forum
32bit Incrementer Example Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 18/01/2019 Async Forum
32bit Incrementer Optimal Design: 3288 Operations 3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation 18/01/2019 Async Forum
32bit Accumulator Example Load or Accumulate 18/01/2019 Async Forum
32bit Accumulator Example Load Accumulate Accumulate Load Accumulate Load 18/01/2019 Async Forum
32bit Accumulator Example 18/01/2019 Async Forum
Red Star MIPS ISA Fast and simple development 32bit RISC Fast and simple development Use synchronous design methodology Complicated features without complicated design effort OOO execution, banked caching… 18/01/2019 Async Forum
Red Star 18/01/2019 Async Forum
Register Bank 18/01/2019 Async Forum
ADD R1, R1, #1 1401 Operations 7.14 GDs per operation 18/01/2019 Async Forum
Additional unnecessary stages to extend the branch shadow Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow 18/01/2019 Async Forum
Overlapping Instructions Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack 18/01/2019 Async Forum
Five Instruction Loop 18/01/2019 Async Forum
Caching RAM Slice 0 Cache Slice 1 Cache 1 1 2 2 3 3 4 5 6 7 Slice 2 1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache 18/01/2019 Async Forum
Caching RAM Slice 0 Cache Slice 1 Cache 1 1 1 1 2 2 2 3 4 5 6 7 1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache 18/01/2019 Async Forum
Caching If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP 18/01/2019 Async Forum
Caching Instead of one large 16Kb cache 16 small 1Kb caches 12bit address 16 small 1Kb caches 8bit address Approximately 50% faster lookup No data duplication 18/01/2019 Async Forum
Two Nasty Loops (7 and 17) 18/01/2019 Async Forum
Area 45,000 gates per slice Approx 6 million transistors (16 way) 15,000 gates without the register bank Approx 6 million transistors (16 way) 2 million without the register bank Final design ~10 million transistors 18/01/2019 Async Forum
How much is 10 million? 18/01/2019 Async Forum
Future work Very early in development Clumsy completion logic One week of development Clumsy completion logic Slowest path analysis Remove unnecessary dependencies Improve worst case latency Target of 5 gate delays per instruction Parallel instruction execution Removing unnecessary latches 18/01/2019 Async Forum
Distant Future Work Simplification of completion logic Use timing assumptions on the reset phase Halve the area Redundant slices Bypass broken slices 18/01/2019 Async Forum
Conclusions Method of producing very fast circuits Minimal design effort Minimal experience required Implicit data dependency Eager evaluation Many improvements possible Area could be halved Performance of 5 gate delays per instruction 18/01/2019 Async Forum