1/26 Performance-oriented Optimisation of Balsa Dual-Rail Circuits Luis Tarazona Advanced Processor Technologies Group School of Computer Science
2/26 Agenda Some examples of performance-oriented coding Current Balsa Optimisations 1.Eliminating redundant FalseVariable components 2.Concurrent RTZ Fetch component 3.Conditional parallel/sequencer component: ParSeq 4.Write-after-read loop unrolling 5.CaseFetch with “default”
3/26 Viterbi Decoder BMU description BMU – Branch Metric Unit Algorithm: Read inputs a,c Calculate: b=7-a, d=7-c Calculate ac=a+c, ad=a+d bd=b+d, bc=b+c Normalise by subtracting Min(ac,ad,bd,bc) from ac,ad,bd,bc Output {ac,ad,bd,bc} Can be pipelined Can be unrolled
4/26 Viterbi Decoder BMU HC Data + control Control
5/26 Optimised BMU description pipeline & unroll operations Pipelined multicast
6/26 Optimised BMU HC Data + Control Control Small scattered control trees Pipelined “data driven” structure
7/26 Viterbi Decoder PMU Description PMU – Path Metric Unit Find global winner Normalise state Algorithm: Let MemState be an array of values, then Global winner = the unique MemState[i] which is equal to zero, otherwise no GW found Normalise values by subtracting Min(all MemState) from every MemState[i]
8/26 PMU HC Data + control Control
9/26 Optimised PMU description
10/26 Optimised PMU HC Data + control Control Small scattered control trees
11/26 Agenda Some examples of performance-oriented coding Current Balsa Optimisations 1.Eliminating redundant FalseVariable components 2.Concurrent RTZ Fetch component 3.Conditional parallel/sequencer component: ParSeq 4.Write-after-read loop unrolling 5.CaseFetch with “default”
12/26 Eliminating redundant FVs Targets active input control Single access, single read-port FalseVariable or eagerFalseVariable HCs i -> then CMD end
13/26 Eliminating redundant FVs - Example a, b -> then o <- a + b end
14/26 Eliminating redundant FVs - Example Latency and area reduction Preserves external behaviour a, b -> then o <- a + b end
15/26 Concurrent RTZ Fetch Wires-only dual-rail Fetch and its STG
16/26 Concurrent RTZ Fetch New concurrent RTZ Fetch and its STG
17/26 The ParSeq Acts conditionally as a Concur (parallel) or as a Sequencer HC Few opportunities to apply it in the design examples available –Perhaps caused by its non-existence at that time? Interesting increase in performance, though.
18/26 Handshake circuit implementation ||
19/26 Optimised ParSeq Schematics
20/26 Some Simulation Results Pre-layout, transistor-level simulations, 180nm technology
21/26 Write-after-read Loop unrolling WAR hazards prevents the use of T-element based Seq
22/26 Write-after-read Loop unrolling First-read-unroll to reorder operations inside loop and allow the use of a T-element based sequencer
23/26 Write-after-read Loop unrolling Automatic unrolling if unbounded WAR loop detected
24/26 Write-after-read Loop unrolling Performance gain depends on data width Upper-bound performance gains (transistor-level simulation results, 180 nm) In real examples (32-bit processor units) speed-ups around 4-6% for 32-bit data widths
25/26 CaseFetch with “default” “default” signal decides whether the selected input or a default value is sent to the output To override possibly spurious data (i.e. when reading speculatively while writing a variable/channel) if using DI encoding
26/26 Conclusions and Future Work Conclusions: Coding style heavily influences performance and control optimisation opportunities –Large speed-ups at low area cost (2x-20x /10-25% area overhead) –Pipeline-style descriptions normally have small control trees which leave less room for control resynthesis –Datapath optimisations needed for increasing performance Presented optimisations & components increase performance of highly optimised code in ~10% Future work: To incorporate the optimisations into the Balsa design flow. ParSeq as a construct or as a peephole optimisation? Add performance-oriented coding style guide/examples to the Balsa manual