Presentation is loading. Please wait.

Presentation is loading. Please wait.

Charlie Brej APT Group University of Manchester

Similar presentations


Presentation on theme: "Charlie Brej APT Group University of Manchester"— Presentation transcript:

1 Charlie Brej APT Group University of Manchester
Group Talk Charlie Brej APT Group University of Manchester 19/09/2018 Async Forum

2 Part 1: The Future According to Me
Charlie Brej APT Group University of Manchester 19/09/2018 Async Forum

3 Razor Blades Scheme 1: “Name” [Number] Plus/Extreme/Ultra/Turbo/?X
1998 Scheme 1: “Name” [Number] Plus/Extreme/Ultra/Turbo/?X Trac II Plus Core Quad Extreme Athlon 64 FX GeForce 8800 Ultra 1971 1901 Scheme 2: “Company Name” Fusion/Quattro/Mach Gillette Fusion, AMD Fusion, Ford Fusion Schick Quattro, NVIDIA Quadro, Audi Quattro Gillette Mach, ATI Mach, Ford Mustang Mach 1 Maybe more soon… 2005 2004 19/09/2018 Async Forum

4 Razor Blade History 19/09/2018 Async Forum

5 Prediction:2007 Jan-Sept 15 Blade Apple iShave 19/09/2018 Async Forum

6 Why did this not happen? Because you don’t need more than five blades on your razor Unless we grow larger faces Which hasn’t happened before, so we wont need them for some time We don’t need more than four processors Unless we invent an automagic parallelism extractor Which we haven’t since the 60s, so we wont need them for some time People will still demand faster single thread performance 19/09/2018 Async Forum

7 Real Future Moore’s law will continue
Transistor count doubles every 18 months Moving into 3rd dimension Intelligent transistors placed per person will remain constant Not copy-paste Verification becomes problematic Designs become very complicated 19/09/2018 Async Forum

8 How about “Intel Terrano”
Productivity Managers 40% Grunt Coder 80% Can we make it pink? Sales 0% Hero Coder 100% Marketting -20% Maintainers 60% Admin 20% How about “Intel Terrano” 19/09/2018 Async Forum

9 Brej’s Law Person years per design doubles every 18 months
Most transistors are copy-paste Verification becomes much more complex Hero coders become more rare People get stupider Marketing becomes more important 19/09/2018 Async Forum

10 Brej’s Law 1985: 5 person years 1997: 2560 person years
ARM 1997: 2560 person years Pentium II (about right) 2007: person years Intel has 94,000 employees AMD has 16,000 A new design every 7 years 19/09/2018 Async Forum

11 Brej’s Law 2028: Entire population of the USA are employed by Intel
2031: Entire population of China employed by AMD 2034: Entire world population working on creating Pentium 12 2090: Project to build Pentium 15 starts but hits a snag as universe finishes before the project does 19/09/2018 Async Forum

12 “The most powerful force in the universe is compound interest”
Albert Einstein “And we didn't have any fancy Sony Playstation video games We had the Atari 2600! There were no multiple levels or screens. It was just ONE screen, forever, and you could never win. The game just kept getting harder and faster and until you died. Just like LIFE!” Ernest Cline 19/09/2018 Async Forum

13 Back to the Future Transistors will be free Diminishing returns
Mostly consumed in memory Diminishing returns Single thread grinds to a halt Increase performance by 1% get 100% more money Fewer designs Very expensive and long lead up times Extend rather than redesign 19/09/2018 Async Forum

14 Part 2: Wagging Logic: Non Throughput-Bound Design Methodology
Charlie Brej APT Group University of Manchester 19/09/2018 Async Forum

15 Introduction Async performance Wagging Logic Red Star Conclusions
Asynchronous logic is slow Wagging Logic Example circuits Red Star Design Results Conclusions 19/09/2018 Async Forum

16 Data propagation Logic C C C C C C C C Latency Cycle Time 1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 19/09/2018 Async Forum

17 Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1
1 2 3 4 5 6 7 8 9 10 11 12 19/09/2018 Async Forum

18 Control propagation Logic C C C C C C C C C C C C Latency Cycle Time 1
1 2 3 4 5 6 7 8 9 10 11 12 19/09/2018 Async Forum

19 And then it gets worse Latency is at least six times lower than the cycle time Assumes all data arrives at arrive at the same time Assumes all acknowledgements arrive at the same time Actual number is somewhere between 10 and 100 19/09/2018 Async Forum

20 What can we do Use two-phase signalling Fine grain pipelining
Halve the control delay Loose all average case advantages Fine grain pipelining Need to add 10+ latches per stage Adds latency Faster completion Anti-tokens, Early-drop latches… Careful timing analysis 19/09/2018 Async Forum

21 Wagging Latches Alternate latch read/write Capacity of two latches
Depth of one latch 19/09/2018 Async Forum

22 Wagging Logic Apply same method to the logic
Rotate logic allowing one to set while others reset Set Reset Reset 19/09/2018 Async Forum

23 Single Channel Mixer 19/09/2018 Async Forum

24 LCM Channels Mixer 19/09/2018 Async Forum

25 Direct Connection Mixer
19/09/2018 Async Forum

26 32bit Incrementer Example
Reg +1 Slice 0 Reg +1 Slice 1 HB +1 Slice 2 HB +1 19/09/2018 Async Forum

27 32bit Incrementer Optimal Design: 3288 Operations
3.04 GDs per operation Original Design: 77 Operations 130 GDs per operation 19/09/2018 Async Forum

28 32bit Accumulator Example
Load or Accumulate 19/09/2018 Async Forum

29 32bit Accumulator Example
Load Accumulate Accumulate Load Accumulate Load 19/09/2018 Async Forum

30 32bit Accumulator Example
19/09/2018 Async Forum

31 Transistors are “Free”
What is expensive? Design effort Time to market Yield What we want Simple Copy-Paste Redundancy 19/09/2018 Async Forum

32 Redundancy Slice Slice Slice Slice Slice Slice 19/09/2018 Async Forum

33 Arrangement Slice 0 Slice 0 Slice 0 Slice 2 Slice 1 Slice 5 Slice 3
19/09/2018 Async Forum

34 Teaching Monkeys Dynamic extraction of parallelism
Implicit data dependency tracking No locking No polling No handshakes Average case performance 19/09/2018 Async Forum

35 Red Star MIPS ISA Fast and simple development
32bit RISC Fast and simple development Use synchronous design methodology Complicated features without complicated design effort OOO execution, banked caching… 19/09/2018 Async Forum

36 Red Star 19/09/2018 Async Forum

37 Register Bank 19/09/2018 Async Forum

38 ADD R1, R1, #1 1401 Operations 7.14 GDs per operation 19/09/2018
Async Forum

39 Additional unnecessary stages to extend the branch shadow
Branch Logic PC +1 + Additional unnecessary stages to extend the branch shadow 19/09/2018 Async Forum

40 Overlapping Instructions
Fetch Decode Execute Memory Dummy WriteBack Branch Shadow Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack Fetch Decode Execute Memory Dummy WriteBack 19/09/2018 Async Forum

41 Nine Instruction Loop 19/09/2018 Async Forum

42 Caching: 4 Instruction Loop
RAM Slice 0 Cache Slice 1 Cache 1 1 2 2 3 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Instruction 3:Branch 0 Slice 3 Cache 19/09/2018 Async Forum

43 Caching: 3 Instruction Loop
RAM Slice 0 Cache Slice 1 Cache 1 1 1 1 2 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache 19/09/2018 Async Forum

44 Caching: Delayed Branch
RAM Slice 0 Cache If (PC%WagLevel != Slice) Execute a NOP Don’t increment the PC Slice 1 Cache 1 1 2 2 3 4 5 6 7 Slice 2 Cache 0:Instruction 1:Instruction 2:Branch 0 Slice 3 Cache NOP 19/09/2018 Async Forum

45 Caching Instead of one large 16Kb cache 16 small 1Kb caches
12bit address 16 small 1Kb caches 8bit address Approximately 50% faster lookup No data duplication 19/09/2018 Async Forum

46 Area ~4 times larger than synchronous Currently 45,000 gates per slice
Times the number of slices Currently 45,000 gates per slice 15,000 gates without the register bank Approx 6 million transistors (16 way) 2 million without the register bank Final design target: 4 million transistors Don’t wag the register bank (66% of area) Simplify completion detection (50% of area) Technology mapper Complete the ISA 19/09/2018 Async Forum

47 How much is 4 million? 19/09/2018 Async Forum

48 How much is 4 million? 19/09/2018 Async Forum

49 How much is 4 million? 19/09/2018 Async Forum

50 How much is 4 million? 19/09/2018 Async Forum

51 How much is 4 million? 19/09/2018 Async Forum

52 Performance Gate delay based simulations No optimiser
No technology mapper 7 Gate delays per instruction 10±3 inversion delays Target of 5 7±2 inversion delays 19/09/2018 Async Forum

53 How Much is 10 Inversion Delays
19/09/2018 Async Forum

54 How Much is 10 Inversion Delays
19/09/2018 Async Forum

55 Future work Very early in development Clumsy completion logic
One week of development Clumsy completion logic Slowest path analysis Remove unnecessary dependencies Improve worst case latency Target of 5 gate delays per instruction Parallel instruction execution Removing unnecessary latches 19/09/2018 Async Forum

56 Conclusions Method of producing very fast circuits
Minimal design effort Minimal experience required Implicit data dependency Eager evaluation Many improvements possible Area could be halved Performance of 5 gate delays per instruction 19/09/2018 Async Forum


Download ppt "Charlie Brej APT Group University of Manchester"

Similar presentations


Ads by Google