Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Asynchronous Circuit Design and Application

Similar presentations


Presentation on theme: "High Performance Asynchronous Circuit Design and Application"— Presentation transcript:

1 High Performance Asynchronous Circuit Design and Application
Charlie Brej APT Group University of Manchester 18/01/2019 Async Forum

2 Introduction Async performance Wagging Logic Red Star Conclusions
Asynchronous logic is slow Wagging Logic Example circuits Red Star Design Results Conclusions 18/01/2019 Async Forum

3 Data propagation Logic C C C C C C C C Latency Cycle Time 1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum

4 Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1
1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum

5 Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1
1 2 3 4 5 6 7 8 9 10 11 12 18/01/2019 Async Forum

6 And then it gets worse Latency is at least six times lower than the cycle time Assumes all data arrives at arrive at the same time Assumes all acknowledgements arrive at the same time Actual number is somewhere between 10 and 100 18/01/2019 Async Forum

7 What can we do Use two-phase signalling Fine grain pipelining
Halve the control delay Loose all average case advantages Fine grain pipelining Need to add 10+ latches per stage Adds latency Faster completion Anti-tokens, Early-drop latches… Careful timing analysis 18/01/2019 Async Forum

8 Wagging Latches Alternate latch read/write Capacity of two latches
Depth of one latch 18/01/2019 Async Forum

9 Wagging Logic Apply same method to the logic
Rotate logic allowing one to set while others reset Set Reset Reset 18/01/2019 Async Forum

10 Single Channel Mixer 18/01/2019 Async Forum

11 LCM Channels Mixer 18/01/2019 Async Forum

12 Direct Connection Mixer
18/01/2019 Async Forum

13 32bit Incrementer Example
Reg +1 N Reg N +1 Reg Reg +1 +1 18/01/2019 Async Forum

14 32bit Incrementer Example
Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 18/01/2019 Async Forum

15 32bit Incrementer Optimal Design: 3288 Operations
3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation 18/01/2019 Async Forum

16 32bit Accumulator Example
Load or Accumulate 18/01/2019 Async Forum

17 32bit Accumulator Example
Load Accumulate Accumulate Load Accumulate Load 18/01/2019 Async Forum

18 32bit Accumulator Example
18/01/2019 Async Forum

19 Red Star MIPS ISA Fast and simple development
32bit RISC Fast and simple development Use synchronous design methodology Complicated features without complicated design effort OOO execution, banked caching… 18/01/2019 Async Forum

20 Red Star 18/01/2019 Async Forum

21 Register Bank 18/01/2019 Async Forum

22 ADD R1, R1, #1 1401 Operations 7.14 GDs per operation 18/01/2019
Async Forum

23 Additional unnecessary stages to extend the branch shadow
Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow 18/01/2019 Async Forum

24 Overlapping Instructions
Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack 18/01/2019 Async Forum

25 Five Instruction Loop 18/01/2019 Async Forum

26 Caching RAM Slice 0 Cache Slice 1 Cache 1 1 2 2 3 3 4 5 6 7 Slice 2
1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache 18/01/2019 Async Forum

27 Caching RAM Slice 0 Cache Slice 1 Cache 1 1 1 1 2 2 2 3 4 5 6 7
1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache 18/01/2019 Async Forum

28 Caching If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC
RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP 18/01/2019 Async Forum

29 Caching Instead of one large 16Kb cache 16 small 1Kb caches
12bit address 16 small 1Kb caches 8bit address Approximately 50% faster lookup No data duplication 18/01/2019 Async Forum

30 Two Nasty Loops (7 and 17) 18/01/2019 Async Forum

31 Area 45,000 gates per slice Approx 6 million transistors (16 way)
15,000 gates without the register bank Approx 6 million transistors (16 way) 2 million without the register bank Final design ~10 million transistors 18/01/2019 Async Forum

32 How much is 10 million? 18/01/2019 Async Forum

33 Future work Very early in development Clumsy completion logic
One week of development Clumsy completion logic Slowest path analysis Remove unnecessary dependencies Improve worst case latency Target of 5 gate delays per instruction Parallel instruction execution Removing unnecessary latches 18/01/2019 Async Forum

34 Distant Future Work Simplification of completion logic
Use timing assumptions on the reset phase Halve the area Redundant slices Bypass broken slices 18/01/2019 Async Forum

35 Conclusions Method of producing very fast circuits
Minimal design effort Minimal experience required Implicit data dependency Eager evaluation Many improvements possible Area could be halved Performance of 5 gate delays per instruction 18/01/2019 Async Forum


Download ppt "High Performance Asynchronous Circuit Design and Application"

Similar presentations


Ads by Google