Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science.

Similar presentations


Presentation on theme: "1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science."— Presentation transcript:

1 1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science

2 2/26 Agenda Some examples of performance-oriented coding Current Balsa Optimisations 1.Eliminating redundant FalseVariable components 2.Concurrent RTZ Fetch component 3.Conditional parallel/sequencer component: ParSeq 4.Write-after-read loop unrolling 5.CaseFetch with “default”

3 3/26 Viterbi Decoder BMU description BMU – Branch Metric Unit Algorithm: Read inputs a,c Calculate: b=7-a, d=7-c Calculate ac=a+c, ad=a+d bd=b+d, bc=b+c Normalise by subtracting Min(ac,ad,bd,bc) from ac,ad,bd,bc Output {ac,ad,bd,bc} Can be pipelined Can be unrolled

4 4/26 Viterbi Decoder BMU HC Data + control Control

5 5/26 Optimised BMU description pipeline & unroll operations Pipelined multicast

6 6/26 Optimised BMU HC Data + Control Control Small scattered control trees Pipelined “data driven” structure

7 7/26 Viterbi Decoder PMU Description PMU – Path Metric Unit Find global winner Normalise state Algorithm: Let MemState be an array of values, then Global winner = the unique MemState[i] which is equal to zero, otherwise no GW found Normalise values by subtracting Min(all MemState) from every MemState[i]

8 8/26 PMU HC Data + control Control

9 9/26 Optimised PMU description

10 10/26 Optimised PMU HC Data + control Control Small scattered control trees

11 11/26 Agenda Some examples of performance-oriented coding Current Balsa Optimisations 1.Eliminating redundant FalseVariable components 2.Concurrent RTZ Fetch component 3.Conditional parallel/sequencer component: ParSeq 4.Write-after-read loop unrolling 5.CaseFetch with “default”

12 12/26 Eliminating redundant FVs Targets active input control Single access, single read-port FalseVariable or eagerFalseVariable HCs i -> then CMD end

13 13/26 Eliminating redundant FVs - Example a, b -> then o <- a + b end

14 14/26 Eliminating redundant FVs - Example Latency and area reduction Preserves external behaviour a, b -> then o <- a + b end

15 15/26 Concurrent RTZ Fetch Wires-only dual-rail Fetch and its STG

16 16/26 Concurrent RTZ Fetch New concurrent RTZ Fetch and its STG

17 17/26 The ParSeq Acts conditionally as a Concur (parallel) or as a Sequencer HC Few opportunities to apply it in the design examples available –Perhaps caused by its non-existence at that time? Interesting increase in performance, though.

18 18/26 Handshake circuit implementation ||

19 19/26 Optimised ParSeq Schematics

20 20/26 Some Simulation Results Pre-layout, transistor-level simulations, 180nm technology

21 21/26 Write-after-read Loop unrolling WAR hazards prevents the use of T-element based Seq

22 22/26 Write-after-read Loop unrolling First-read-unroll to reorder operations inside loop and allow the use of a T-element based sequencer

23 23/26 Write-after-read Loop unrolling Automatic unrolling if unbounded WAR loop detected

24 24/26 Write-after-read Loop unrolling Performance gain depends on data width Upper-bound performance gains (transistor-level simulation results, 180 nm) In real examples (32-bit processor units) speed-ups around 4-6% for 32-bit data widths

25 25/26 CaseFetch with “default” “default” signal decides whether the selected input or a default value is sent to the output To override possibly spurious data (i.e. when reading speculatively while writing a variable/channel) if using DI encoding

26 26/26 Conclusions and Future Work Conclusions: Coding style heavily influences performance and control optimisation opportunities –Large speed-ups at low area cost (2x-20x /10-25% area overhead) –Pipeline-style descriptions normally have small control trees which leave less room for control resynthesis –Datapath optimisations needed for increasing performance Presented optimisations & components increase performance of highly optimised code in ~10% Future work: To incorporate the optimisations into the Balsa design flow. ParSeq as a construct or as a peephole optimisation? Add performance-oriented coding style guide/examples to the Balsa manual


Download ppt "1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science."

Similar presentations


Ads by Google