Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan.

Similar presentations


Presentation on theme: "Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan."— Presentation transcript:

1 Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan

2 A 3C, 4C crosstalk data transmission sequence on a bus Definition of 4C Crosstalk victim aggressor

3 Worst-Case Delay Comparison (ps) lengthbufsize4C+3C free4C free4C 5mm30x241 (36%)516 (77%)665 5mm60x213 (52%)399 (99%)402 5mm120x136 (48%)196 (70%)279 10mm30x437 (42%)912 (88%)1026 10mm60x413 (44%)722 (78%)919 10mm120x270 (49%)379 (69%)548 20mm30x793 (50%)1068 (67%)1586 20mm60x691 (44%)1161 (74%)1561 20mm120x580 (42%)969 (70%)1365 average45%77%1 Summarized from DATE 2004 “ Exploiting Crosstalk to Speed up On-chip Buses, ” Chunjie Duan

4 Previous Work Bus encoding (expand Boolean space) Hardware overhead: Encoders/Decoders/additional wires SenderEncoderDecoderReceiver b n b channel Copied from ICCAD 2001 “ Bus Encoding to Prevent Crosstalk Delay, ” Bret Victor

5 Motivation Previous work using codec design Logic level – no information of data Large area overhead (e.g., 128 bus width: 128 + 85) Data sequences on an instruction bus Known during compile time To eliminate crosstalk data sequence: Instruction re-scheduling Register renaming

6 Problem Definition and Target Architecture Given a program, Generate a 4C (3C-and-4C) crosstalk-free program (on an instruction bus) Performed in compiler optimization

7 Decomposing the input program to basic blocks Crosstalk Elimination in Compiler Optimization Binary executable program Crosstalk-free binary executable program Step1 Instruction rescheduling Register renaming Step2 3 NOP insertion Step4 Basic blocks Interchange I4 and I5Rename R2 to R3NOP InsertionCrosstalk Free

8 Step 2: Instruction Re-scheduling Instructions reordered under constraints of data dependency Construct a weighted Instruction Adjacency Graph

9 Instruction Adjacency Graph Node : instruction Edge : execution sequence Weight : the number of crosstalk patterns If the crosstalk sequence is from unchangeable bits, the weight is set to be larger Opcode, functional code, constants A BC D E 11 1 0 6 6 1 1

10 Instruction Re-scheduling A weighted Instruction Adjacency Graph Model instruction re-scheduling as a Traveling Salesman Problem (TSP) on IAG To find a minimum weighted path that contains each node once and only once

11 Results of TSP A BC D E 11 1 0 6 6 1 1 Original Sequence Weight: 18 A BC D E 11 1 0 6 6 1 1 Minimum weight sequence Weight: 8

12 Step 3: Register Renaming Registers can be renamed as long as live in/out and system preservative registers are not renamed. Weighted Register Adjacency Graph : RAG Node : register Edge between nodes RA and RB : registers RA and RB are adjacent with each other Weight : frequency

13 Register Adjacency Graph R1 R0 R24 R3R4 R5 1 3 1 2 1 1 1 1 1 AADDR2, R1, R0101, 010, 001, 000 CXORR4, R0, R2000, 100, 000, 010 BMULR1, R2, R0010, 001, 010, 000 DBISR3, R1, 4011, 011, 001, 100 EBISR5, R3, R4011, 101, 011, 100

14 4C Crosstalk-free Cliques In order to rename all registers at a time, a database containing all kinds of 4C crosstalk-free cliques with 5-bit code is pre-constructed.

15 Register Renaming Algorithm REGISTER-RENAMING ( ) 1. Construct RAG 2. Do clique partitioning on RAG 3. while ( RAG is not NULL ) { 4. Select a clique with maximum weight 5. Reassign all registers in the clique 6. Remove the clique from RAG 7.}

16 Example of Register Renaming R1 R0 R24 R4R3 R5 1 2 1 3 1 1 1 1 1 A B C R1 R0 R4 000 001100 2 1 1 R5 R7 101 111 0 4 R6 100 110 0 A’ B’ C’ Assumption: R0 and R1 are live in registers, R5 is live out register

17 Step 4: NOP Insertion An NOP Is inserted between two instructions that induce 4C crosstalk Is crosstalk-free with all other instructions Does not change program functionality Takes a clock period to execute and one memory space to store -> overhead

18 Benchmarking Results

19 Static Instruction Count Overhead SPEC2000 (CINT) Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip247404515598.77%486452.62% crafty1088801189429.24%1117192.61% eon1967682119217.70%2025262.93% gap2321762521528.60%2373682.24% gcc4970845340837.44%5075632.11% gzip51072554138.50%523652.53% mcf40680443439.00%419143.03% parser77888841758.07%795582.14% perlbmk2168642350938.41%2240783.33% twolf1118201227789.80%1150982.93% vortex2043562204217.86%2094002.47% vpr924441010059.26%951412.92% average 8.55% 2.65% 4 ﹒ C Crosstalk-free

20 Dynamic Instruction Count Overhead SPEC2000 (CINT) Benchmar k ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip2882214333193098867065.53%89342490681.27% crafty4264781568489884579314.87%43500002072.00% eon9403936010717505713.97%957209801.79% gap1246758360137934507810.63%12581660810.91% gcc201608637721856497258.41%20315533260.77% gzip336727476436624996838.77%33927636570.76% mcf2596433032776407536.93%2625364161.11% parser420352238844644735696.21%42246664340.50% perlbmk2061247362231507923412.31%20750826730.67% twolf25875907328993621212.05%2610048980.87% vortex9808168516107721143879.83%99058478551.00% vpr69281407178110789812.74%7033102181.52% average 10.19% 1.10%

21 Computation of Improved Performance Ratio 0.10 um, bus length: 10mm Cycle length With 4C : 1 Without 4C : 0.8

22 Improved Total Performance Ratio:SPEC2000 (CINT) BenchmarkInstr ORI Instr CE Ratio IMP bzip28822143331893424906820.44% crafty4264781568435000020719.87% eon940393609572098020.03% gap1246758360125816608120.72% gcc2016086377203155332620.83% gzip3367274764339276365720.84% mcf25964330326253641620.56% parser4203522388422466643421.04% perlbmk2061247362207508267320.91% twolf25875907326100489820.75% vortex9808168516990584785520.65% vpr69281407170331021820.25% average20.57%

23 Thank you

24 Static Instruction Count Overhead: DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio N P Instr CE Ratio CE complex_multiply17412189338.74%180603.72% complex_update17496190368.80%181653.82% convolution17424189498.75%180683.70% dot_product17428189448.70%180733.70% fir17452189818.76%181033.73% fir2dim17740193318.97%184313.90% iir_biquad_N_sections17488190258.79%181433.75% iir_biquad_one_section17416189428.76%180583.69% lms17492190468.88%181543.78% matrix17496190458.85%181533.76% matrix1x317444189618.70%180853.67% n_complex_updates17564191348.94%182443.87% n_real_updates17484190248.81%181543.83% real_update17408189338.76%180503.69% average 8.80% 3.76%

25 Dynamic Instruction Count Overhead : DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE complex_multiply1649187913.95%16590.61% complex_update1767201814.20%17780.62% convolution2400272513.54%24140.58% dot_product1709195114.16%17200.64% fir2842316511.37%28530.39% fir2dim92961025910.36%93380.45% iir_biquad_N_sections267529359.72%26840.34% iir_biquad_one_section1663189213.77%16730.60% lms3069357816.59%30960.88% matrix340383759610.45%340490.03% matrix1x32078234512.85%20980.96% n_complex_updates4867561515.37%48750.16% n_real_updates3187354211.14%31970.31% real_update1654188513.97%16620.48% average 12.96% 0.50%

26 Improved Total Performance Ratio : DSPstone BenchmarkInstr ORI Instr CE Ratio IMP complex_multiply1649165920.96% complex_update1767177820.95% convolution2400241420.98% dot_product1709172020.93% fir2842285321.13% fir2dim9296933821.08% iir_biquad_N_sections2675268421.17% iir_biquad_one_section1663167320.96% lms3069309620.75% matrix340383404921.41% matrix1x32078209820.68% n_complex_updates4867487521.31% n_real_updates3187319721.19% real_update1654166221.06% average19.78%


Download ppt "Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan."

Similar presentations


Ads by Google