BAT: Performance-Driven Crosstalk Mitigation Based on Bus-grouping Asynchronous Transmission Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences
Outline Introduction Proposed BAT Scheme Implementation of BAT Experimental Results Conclusions 2019/1/2
Introduction Technology improvement Lower voltage Higher frequency Higher transistor density Smaller feature size Q: What are the implications for bus wires? 2019/1/2
Introduction This kinds of Capacitance is dominate 2019/1/2
Introduction Crosstalk Speed: 1.8x slower 0.25um Length = 100um Y 0.25um Length = 100um Fan-out = 2 Z Speed: 1.8x slower 2019/1/2
Crosstalk Factor Crosstalk Insensitive Crosstalk Sensitive [P.P. Sotiriadis et al, 2001] Crosstalk Insensitive Crosstalk Sensitive 2019/1/2
Introduction As the technology advanced, the impact of crosstalk gets worse! Aspect ratio gets bigger Bus width gets wider The bus transmission is likely to encounter depressing crosstalk delay. Q: How to alleviate the crosstalk delay effects on bus transmission? 2019/1/2
The conventional approaches Codec [B. Victor et al, 01] [P.P. Sotiriadis et al, 01] Pros: Relatively low bandwidth overhead: but at least 47% Cons: Hard-constructed Codec algorithm for large bus width Shield: Passive Shield, Active Shield [H.Kaul et al, 02] [R.Arunachalam et al, 03] High performance, but Area-hungry: usually 100% area overhead 2019/1/2
Several new approaches Delay-line bus [M. Ghoneima et al, 04] Pros: Nearly zero bandwidth overhead Cons: Very complicated synchronization Lack of scalability Variable cycle transmission [L. Li et al, 04] Low area overhead High performance for relatively narrow buses Due to “Cask Effect”, it is likely to fail for wide buses (width>64-bit or more) 2019/1/2
Variable cycle transmission (DYN) Supposing a transition between two patterns If the two patterns is: {0 1 0 0 1 0 0 1} → {0 0 1 1 0 1 0 0} Transition: {– ↓↑↑↓↑ – ↓} Delay vector: {1, 3, 2, 2, 4, 3, 1, 1} If {0 1 0 1 1 0 0 1} → {0 0 1 1 0 1 0 0} Transition: {– ↓↑- ↓↑ – ↓ } Delay vector: {1, 3, 3, 1, 3, 3, 1, 1} Q: What if the bus width gets larger? The probability of emergency of “4” tends to get higher, and thereby makes DYN not efficient enough! 2019/1/2
Proposed BAT scheme We propose BAT scheme by extending the Variable Cycle Transmission (DYN) scheme What is the BAT ? 2019/1/2
Crosstalk Insensitive BAT scheme Crosstalk Sensitive All transitions are Crosstalk Sensitive Crosstalk Insensitive Not all sub-transitions are Crosstalk Sensitive Asynchronous 2019/1/2
BAT How to group the bus into sub-buses? Q: Which one is best? Or Or It depends on the crosstalk factor distribution! (or application-specific) 2019/1/2
Crosstalk Factor Distribution Instruction bus VS. Data bus Grouping according CF locality Unequally grouping Equally grouping 2019/1/2
Implementation of BAT Grouping line Valid indicating line 2019/1/2
Differential Counter Cluster Synchronizing Mechanism Hold C(i, j) is a bi-directional counter, Range: –L ~ +L (L: buffer length) ‘OF’ short for ‘OverFlow’ ‘UF’ short for ‘UnderFlow’ ‘+’ means logical OR OF UF Hold Hold i th sub-bus, if and only if {OF(C(i, 1)) + OF(C(i, 2)) + · · · + OF(C(i, i−1)) + OF(C(i, i+1)) + · · · + OF(C(i, n)) } is true; Hold j th sub-bus, if and only if {UF(C(1, j)) + UF(C(2, j)) + · · · + UF(C(j − 1, j)) + UF(C(j +1, j)) + · · · + UF(C(n, j)) } is true. 2019/1/2
DAS Scheme (merge the grouping lines and valid-indicating lines) Delay Line Active Shield Simultaneous switch Delayed switch 2019/1/2
DAS Scheme Delay Active Shield Skew: T/2 Reuse the data-valid indicating line as the group line to reduce wire overhead 2019/1/2
Experiment Simplescalar 3.0 SPEC CPU2000 Benchmarks On-chip buses Instruction bus: Instruction buffer to L1 I$ Data bus: Datapath to L1 D$ Compare against ORI (Original conservative approach) 4 cycle/pattern DYN (Variable cycle transmission) 1~4 cycle/pattern CDC (Codec approaches) 2 cycle/pattern 2019/1/2
Results /1 BAT applied to 64-bit instruction bus Compared with ORI, DYN and Codec approaches, the average performance improvement using BAT scheme with 4-Group configuration is 55.3%, 30.4% and 10.5% respectively. 2019/1/2
Results /2 BAT applied to 32-bit data bus Compared with DYN scheme, we still gain 12.5% performance improvement on average with 4-Group configuration. 2019/1/2
Overhead Analysis /1 Wire routing overhead and avg. cycle/pattern 64-bit bus 4-group configuration Approach Normalized Area Avg. cycle/pattern ORI 100 4 DYN 103 2.57 CPC 145 2 PSD 199 ASD 201 1 BAT 113 1.79 2019/1/2
Overhead Analysis /2 How much buffer is sufficient to synchronize the data receiving? Buffer Size VS. Avg. cycle/pattern 2019/1/2 4 ~ 8-word is optimum!
Conclusions Proposing (BAT) Bus-grouping Asynchronous Transmission scheme Optimizing BAT with the locality of CF (crosstalk factor) Distribution Proposing DCC synchronizing mechanism and DAS scheme Improving performance by 30+% and 10+% compared with DYN and Codec approaches at the cost of 13% routing overhead when applied to a 64-bit bus 2019/1/2
Thanks for your attention!