Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan
A 3C, 4C crosstalk data transmission sequence on a bus Definition of 4C Crosstalk victim aggressor
Worst-Case Delay Comparison (ps) lengthbufsize4C+3C free4C free4C 5mm30x241 (36%)516 (77%)665 5mm60x213 (52%)399 (99%)402 5mm120x136 (48%)196 (70%)279 10mm30x437 (42%)912 (88%) mm60x413 (44%)722 (78%)919 10mm120x270 (49%)379 (69%)548 20mm30x793 (50%)1068 (67%) mm60x691 (44%)1161 (74%) mm120x580 (42%)969 (70%)1365 average45%77%1 Summarized from DATE 2004 “ Exploiting Crosstalk to Speed up On-chip Buses, ” Chunjie Duan
Previous Work Bus encoding (expand Boolean space) Hardware overhead: Encoders/Decoders/additional wires SenderEncoderDecoderReceiver b n b channel Copied from ICCAD 2001 “ Bus Encoding to Prevent Crosstalk Delay, ” Bret Victor
Motivation Previous work using codec design Logic level – no information of data Large area overhead (e.g., 128 bus width: ) Data sequences on an instruction bus Known during compile time To eliminate crosstalk data sequence: Instruction re-scheduling Register renaming
Problem Definition and Target Architecture Given a program, Generate a 4C (3C-and-4C) crosstalk-free program (on an instruction bus) Performed in compiler optimization
Decomposing the input program to basic blocks Crosstalk Elimination in Compiler Optimization Binary executable program Crosstalk-free binary executable program Step1 Instruction rescheduling Register renaming Step2 3 NOP insertion Step4 Basic blocks Interchange I4 and I5Rename R2 to R3NOP InsertionCrosstalk Free
Step 2: Instruction Re-scheduling Instructions reordered under constraints of data dependency Construct a weighted Instruction Adjacency Graph
Instruction Adjacency Graph Node : instruction Edge : execution sequence Weight : the number of crosstalk patterns If the crosstalk sequence is from unchangeable bits, the weight is set to be larger Opcode, functional code, constants A BC D E
Instruction Re-scheduling A weighted Instruction Adjacency Graph Model instruction re-scheduling as a Traveling Salesman Problem (TSP) on IAG To find a minimum weighted path that contains each node once and only once
Results of TSP A BC D E Original Sequence Weight: 18 A BC D E Minimum weight sequence Weight: 8
Step 3: Register Renaming Registers can be renamed as long as live in/out and system preservative registers are not renamed. Weighted Register Adjacency Graph : RAG Node : register Edge between nodes RA and RB : registers RA and RB are adjacent with each other Weight : frequency
Register Adjacency Graph R1 R0 R24 R3R4 R AADDR2, R1, R0101, 010, 001, 000 CXORR4, R0, R2000, 100, 000, 010 BMULR1, R2, R0010, 001, 010, 000 DBISR3, R1, 4011, 011, 001, 100 EBISR5, R3, R4011, 101, 011, 100
4C Crosstalk-free Cliques In order to rename all registers at a time, a database containing all kinds of 4C crosstalk-free cliques with 5-bit code is pre-constructed.
Register Renaming Algorithm REGISTER-RENAMING ( ) 1. Construct RAG 2. Do clique partitioning on RAG 3. while ( RAG is not NULL ) { 4. Select a clique with maximum weight 5. Reassign all registers in the clique 6. Remove the clique from RAG 7.}
Example of Register Renaming R1 R0 R24 R4R3 R A B C R1 R0 R R5 R R A’ B’ C’ Assumption: R0 and R1 are live in registers, R5 is live out register
Step 4: NOP Insertion An NOP Is inserted between two instructions that induce 4C crosstalk Is crosstalk-free with all other instructions Does not change program functionality Takes a clock period to execute and one memory space to store -> overhead
Benchmarking Results
Static Instruction Count Overhead SPEC2000 (CINT) Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip % % crafty % % eon % % gap % % gcc % % gzip % % mcf % % parser % % perlbmk % % twolf % % vortex % % vpr % % average 8.55% 2.65% 4 ﹒ C Crosstalk-free
Dynamic Instruction Count Overhead SPEC2000 (CINT) Benchmar k ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip % % crafty % % eon % % gap % % gcc % % gzip % % mcf % % parser % % perlbmk % % twolf % % vortex % % vpr % % average 10.19% 1.10%
Computation of Improved Performance Ratio 0.10 um, bus length: 10mm Cycle length With 4C : 1 Without 4C : 0.8
Improved Total Performance Ratio:SPEC2000 (CINT) BenchmarkInstr ORI Instr CE Ratio IMP bzip % crafty % eon % gap % gcc % gzip % mcf % parser % perlbmk % twolf % vortex % vpr % average20.57%
Thank you
Static Instruction Count Overhead: DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio N P Instr CE Ratio CE complex_multiply % % complex_update % % convolution % % dot_product % % fir % % fir2dim % % iir_biquad_N_sections % % iir_biquad_one_section % % lms % % matrix % % matrix1x % % n_complex_updates % % n_real_updates % % real_update % % average 8.80% 3.76%
Dynamic Instruction Count Overhead : DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE complex_multiply % % complex_update % % convolution % % dot_product % % fir % % fir2dim % % iir_biquad_N_sections % % iir_biquad_one_section % % lms % % matrix % % matrix1x % % n_complex_updates % % n_real_updates % % real_update % % average 12.96% 0.50%
Improved Total Performance Ratio : DSPstone BenchmarkInstr ORI Instr CE Ratio IMP complex_multiply % complex_update % convolution % dot_product % fir % fir2dim % iir_biquad_N_sections % iir_biquad_one_section % lms % matrix % matrix1x % n_complex_updates % n_real_updates % real_update % average19.78%