Download presentation
Presentation is loading. Please wait.
1
Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan
2
A 3C, 4C crosstalk data transmission sequence on a bus Definition of 4C Crosstalk victim aggressor
3
Worst-Case Delay Comparison (ps) lengthbufsize4C+3C free4C free4C 5mm30x241 (36%)516 (77%)665 5mm60x213 (52%)399 (99%)402 5mm120x136 (48%)196 (70%)279 10mm30x437 (42%)912 (88%)1026 10mm60x413 (44%)722 (78%)919 10mm120x270 (49%)379 (69%)548 20mm30x793 (50%)1068 (67%)1586 20mm60x691 (44%)1161 (74%)1561 20mm120x580 (42%)969 (70%)1365 average45%77%1 Summarized from DATE 2004 “ Exploiting Crosstalk to Speed up On-chip Buses, ” Chunjie Duan
4
Previous Work Bus encoding (expand Boolean space) Hardware overhead: Encoders/Decoders/additional wires SenderEncoderDecoderReceiver b n b channel Copied from ICCAD 2001 “ Bus Encoding to Prevent Crosstalk Delay, ” Bret Victor
5
Motivation Previous work using codec design Logic level – no information of data Large area overhead (e.g., 128 bus width: 128 + 85) Data sequences on an instruction bus Known during compile time To eliminate crosstalk data sequence: Instruction re-scheduling Register renaming
6
Problem Definition and Target Architecture Given a program, Generate a 4C (3C-and-4C) crosstalk-free program (on an instruction bus) Performed in compiler optimization
7
Decomposing the input program to basic blocks Crosstalk Elimination in Compiler Optimization Binary executable program Crosstalk-free binary executable program Step1 Instruction rescheduling Register renaming Step2 3 NOP insertion Step4 Basic blocks Interchange I4 and I5Rename R2 to R3NOP InsertionCrosstalk Free
8
Step 2: Instruction Re-scheduling Instructions reordered under constraints of data dependency Construct a weighted Instruction Adjacency Graph
9
Instruction Adjacency Graph Node : instruction Edge : execution sequence Weight : the number of crosstalk patterns If the crosstalk sequence is from unchangeable bits, the weight is set to be larger Opcode, functional code, constants A BC D E 11 1 0 6 6 1 1
10
Instruction Re-scheduling A weighted Instruction Adjacency Graph Model instruction re-scheduling as a Traveling Salesman Problem (TSP) on IAG To find a minimum weighted path that contains each node once and only once
11
Results of TSP A BC D E 11 1 0 6 6 1 1 Original Sequence Weight: 18 A BC D E 11 1 0 6 6 1 1 Minimum weight sequence Weight: 8
12
Step 3: Register Renaming Registers can be renamed as long as live in/out and system preservative registers are not renamed. Weighted Register Adjacency Graph : RAG Node : register Edge between nodes RA and RB : registers RA and RB are adjacent with each other Weight : frequency
13
Register Adjacency Graph R1 R0 R24 R3R4 R5 1 3 1 2 1 1 1 1 1 AADDR2, R1, R0101, 010, 001, 000 CXORR4, R0, R2000, 100, 000, 010 BMULR1, R2, R0010, 001, 010, 000 DBISR3, R1, 4011, 011, 001, 100 EBISR5, R3, R4011, 101, 011, 100
14
4C Crosstalk-free Cliques In order to rename all registers at a time, a database containing all kinds of 4C crosstalk-free cliques with 5-bit code is pre-constructed.
15
Register Renaming Algorithm REGISTER-RENAMING ( ) 1. Construct RAG 2. Do clique partitioning on RAG 3. while ( RAG is not NULL ) { 4. Select a clique with maximum weight 5. Reassign all registers in the clique 6. Remove the clique from RAG 7.}
16
Example of Register Renaming R1 R0 R24 R4R3 R5 1 2 1 3 1 1 1 1 1 A B C R1 R0 R4 000 001100 2 1 1 R5 R7 101 111 0 4 R6 100 110 0 A’ B’ C’ Assumption: R0 and R1 are live in registers, R5 is live out register
17
Step 4: NOP Insertion An NOP Is inserted between two instructions that induce 4C crosstalk Is crosstalk-free with all other instructions Does not change program functionality Takes a clock period to execute and one memory space to store -> overhead
18
Benchmarking Results
19
Static Instruction Count Overhead SPEC2000 (CINT) Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip247404515598.77%486452.62% crafty1088801189429.24%1117192.61% eon1967682119217.70%2025262.93% gap2321762521528.60%2373682.24% gcc4970845340837.44%5075632.11% gzip51072554138.50%523652.53% mcf40680443439.00%419143.03% parser77888841758.07%795582.14% perlbmk2168642350938.41%2240783.33% twolf1118201227789.80%1150982.93% vortex2043562204217.86%2094002.47% vpr924441010059.26%951412.92% average 8.55% 2.65% 4 ﹒ C Crosstalk-free
20
Dynamic Instruction Count Overhead SPEC2000 (CINT) Benchmar k ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip2882214333193098867065.53%89342490681.27% crafty4264781568489884579314.87%43500002072.00% eon9403936010717505713.97%957209801.79% gap1246758360137934507810.63%12581660810.91% gcc201608637721856497258.41%20315533260.77% gzip336727476436624996838.77%33927636570.76% mcf2596433032776407536.93%2625364161.11% parser420352238844644735696.21%42246664340.50% perlbmk2061247362231507923412.31%20750826730.67% twolf25875907328993621212.05%2610048980.87% vortex9808168516107721143879.83%99058478551.00% vpr69281407178110789812.74%7033102181.52% average 10.19% 1.10%
21
Computation of Improved Performance Ratio 0.10 um, bus length: 10mm Cycle length With 4C : 1 Without 4C : 0.8
22
Improved Total Performance Ratio:SPEC2000 (CINT) BenchmarkInstr ORI Instr CE Ratio IMP bzip28822143331893424906820.44% crafty4264781568435000020719.87% eon940393609572098020.03% gap1246758360125816608120.72% gcc2016086377203155332620.83% gzip3367274764339276365720.84% mcf25964330326253641620.56% parser4203522388422466643421.04% perlbmk2061247362207508267320.91% twolf25875907326100489820.75% vortex9808168516990584785520.65% vpr69281407170331021820.25% average20.57%
23
Thank you
24
Static Instruction Count Overhead: DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio N P Instr CE Ratio CE complex_multiply17412189338.74%180603.72% complex_update17496190368.80%181653.82% convolution17424189498.75%180683.70% dot_product17428189448.70%180733.70% fir17452189818.76%181033.73% fir2dim17740193318.97%184313.90% iir_biquad_N_sections17488190258.79%181433.75% iir_biquad_one_section17416189428.76%180583.69% lms17492190468.88%181543.78% matrix17496190458.85%181533.76% matrix1x317444189618.70%180853.67% n_complex_updates17564191348.94%182443.87% n_real_updates17484190248.81%181543.83% real_update17408189338.76%180503.69% average 8.80% 3.76%
25
Dynamic Instruction Count Overhead : DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE complex_multiply1649187913.95%16590.61% complex_update1767201814.20%17780.62% convolution2400272513.54%24140.58% dot_product1709195114.16%17200.64% fir2842316511.37%28530.39% fir2dim92961025910.36%93380.45% iir_biquad_N_sections267529359.72%26840.34% iir_biquad_one_section1663189213.77%16730.60% lms3069357816.59%30960.88% matrix340383759610.45%340490.03% matrix1x32078234512.85%20980.96% n_complex_updates4867561515.37%48750.16% n_real_updates3187354211.14%31970.31% real_update1654188513.97%16620.48% average 12.96% 0.50%
26
Improved Total Performance Ratio : DSPstone BenchmarkInstr ORI Instr CE Ratio IMP complex_multiply1649165920.96% complex_update1767177820.95% convolution2400241420.98% dot_product1709172020.93% fir2842285321.13% fir2dim9296933821.08% iir_biquad_N_sections2675268421.17% iir_biquad_one_section1663167320.96% lms3069309620.75% matrix340383404921.41% matrix1x32078209820.68% n_complex_updates4867487521.31% n_real_updates3187319721.19% real_update1654166221.06% average19.78%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.