Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan.

Slides:

Advertisements

Similar presentations

Exploiting Crosstalk to Speed up On-chip Buses Chunjie Duan Ericsson Wireless, Boulder Sunil P Khatri University of Colorado, Boulder.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical.

Computer Architecture Instruction-Level Parallel Processors

CSCI 4717/5717 Computer Architecture

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 CS 201 Compiler Construction Machine Code Generation.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

1 Cost Effective Dynamic Program Slicing Xiangyu Zhang Rajiv Gupta The University of Arizona.

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

System Clock, clock speed, Word Length & Bus Width.

August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Decomposition of Instruction Decoder for Low Power Design TingTing Hwang Department of Computer Science Tsing Hua University.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Analysis and Avoidance of Cross-talk in on-chip buses Chunjie Duan Ericsson Wireless Communications Anup Tirumala Jasmine Networks Sunil P Khatri University.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.

Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.

Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS252 project presentation.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Forbidden Transition Free Crosstalk Avoidance CODEC Design Chunjie Duan Mitsubishi Electric Research Labs, Cambridge, MA, USA Chengyu Zhu Polaris Microelectronic.

Computer Organization and Architecture Tutorial 1 Kenneth Lee.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

UNIT III -PIPELINE.

Po-Lung Chen (Dont block me) d091: Urban Transport System 2010/03/26 (1) d091: Urban Transport System Po-Lung Chen Team Dont Block Me, National Taiwan.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 23 Introduction Computer Specification –Instruction Set Architecture (ISA) - the specification.

Bus Encoding to Prevent Crosstalk Delay Bert Victor and Kurt Keutzer ICCAD 2001.

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.

Ghent University Veerle Desmet Lieven Eeckhout Koen De Bosschere Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction.

Contents Introduction Bus Power Model Related Works Motivation

CS161 – Design and Architecture of Computer Systems

Assessing and Understanding Performance

Greedy Technique.

Antonia Zhai, Christopher B. Colohan,

Instruction Scheduling Hal Perkins Summer 2004

Instruction Scheduling Hal Perkins Winter 2008

Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

ICS 252 Introduction to Computer Design

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Scheduling Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Low Power Digital Design

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Scheduling Hal Perkins Autumn 2011

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan

A 3C, 4C crosstalk data transmission sequence on a bus Definition of 4C Crosstalk victim aggressor

Worst-Case Delay Comparison (ps) lengthbufsize4C+3C free4C free4C 5mm30x241 (36%)516 (77%)665 5mm60x213 (52%)399 (99%)402 5mm120x136 (48%)196 (70%)279 10mm30x437 (42%)912 (88%) mm60x413 (44%)722 (78%)919 10mm120x270 (49%)379 (69%)548 20mm30x793 (50%)1068 (67%) mm60x691 (44%)1161 (74%) mm120x580 (42%)969 (70%)1365 average45%77%1 Summarized from DATE 2004 “ Exploiting Crosstalk to Speed up On-chip Buses, ” Chunjie Duan

Previous Work Bus encoding (expand Boolean space) Hardware overhead: Encoders/Decoders/additional wires SenderEncoderDecoderReceiver b n b channel Copied from ICCAD 2001 “ Bus Encoding to Prevent Crosstalk Delay, ” Bret Victor

Motivation Previous work using codec design Logic level – no information of data Large area overhead (e.g., 128 bus width: ) Data sequences on an instruction bus Known during compile time To eliminate crosstalk data sequence: Instruction re-scheduling Register renaming

Problem Definition and Target Architecture Given a program, Generate a 4C (3C-and-4C) crosstalk-free program (on an instruction bus) Performed in compiler optimization

Decomposing the input program to basic blocks Crosstalk Elimination in Compiler Optimization Binary executable program Crosstalk-free binary executable program Step1 Instruction rescheduling Register renaming Step2 3 NOP insertion Step4 Basic blocks Interchange I4 and I5Rename R2 to R3NOP InsertionCrosstalk Free

Step 2: Instruction Re-scheduling Instructions reordered under constraints of data dependency Construct a weighted Instruction Adjacency Graph

Instruction Adjacency Graph Node : instruction Edge : execution sequence Weight : the number of crosstalk patterns If the crosstalk sequence is from unchangeable bits, the weight is set to be larger Opcode, functional code, constants A BC D E

Instruction Re-scheduling A weighted Instruction Adjacency Graph Model instruction re-scheduling as a Traveling Salesman Problem (TSP) on IAG To find a minimum weighted path that contains each node once and only once

Results of TSP A BC D E Original Sequence Weight: 18 A BC D E Minimum weight sequence Weight: 8

Step 3: Register Renaming Registers can be renamed as long as live in/out and system preservative registers are not renamed. Weighted Register Adjacency Graph : RAG Node : register Edge between nodes RA and RB : registers RA and RB are adjacent with each other Weight : frequency

Register Adjacency Graph R1 R0 R24 R3R4 R AADDR2, R1, R0101, 010, 001, 000 CXORR4, R0, R2000, 100, 000, 010 BMULR1, R2, R0010, 001, 010, 000 DBISR3, R1, 4011, 011, 001, 100 EBISR5, R3, R4011, 101, 011, 100

4C Crosstalk-free Cliques In order to rename all registers at a time, a database containing all kinds of 4C crosstalk-free cliques with 5-bit code is pre-constructed.

Register Renaming Algorithm REGISTER-RENAMING ( ) 1. Construct RAG 2. Do clique partitioning on RAG 3. while ( RAG is not NULL ) { 4. Select a clique with maximum weight 5. Reassign all registers in the clique 6. Remove the clique from RAG 7.}

Example of Register Renaming R1 R0 R24 R4R3 R A B C R1 R0 R R5 R R A’ B’ C’ Assumption: R0 and R1 are live in registers, R5 is live out register

Step 4: NOP Insertion An NOP Is inserted between two instructions that induce 4C crosstalk Is crosstalk-free with all other instructions Does not change program functionality Takes a clock period to execute and one memory space to store -> overhead

Benchmarking Results

Static Instruction Count Overhead SPEC2000 (CINT) Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip % % crafty % % eon % % gap % % gcc % % gzip % % mcf % % parser % % perlbmk % % twolf % % vortex % % vpr % % average 8.55% 2.65% 4 ﹒ C Crosstalk-free

Dynamic Instruction Count Overhead SPEC2000 (CINT) Benchmar k ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE bzip % % crafty % % eon % % gap % % gcc % % gzip % % mcf % % parser % % perlbmk % % twolf % % vortex % % vpr % % average 10.19% 1.10%

Computation of Improved Performance Ratio 0.10 um, bus length: 10mm Cycle length With 4C : 1 Without 4C : 0.8

Improved Total Performance Ratio:SPEC2000 (CINT) BenchmarkInstr ORI Instr CE Ratio IMP bzip % crafty % eon % gap % gcc % gzip % mcf % parser % perlbmk % twolf % vortex % vpr % average20.57%

Thank you

Static Instruction Count Overhead: DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio N P Instr CE Ratio CE complex_multiply % % complex_update % % convolution % % dot_product % % fir % % fir2dim % % iir_biquad_N_sections % % iir_biquad_one_section % % lms % % matrix % % matrix1x % % n_complex_updates % % n_real_updates % % real_update % % average 8.80% 3.76%

Dynamic Instruction Count Overhead : DSPstone Benchmark ORINPCE Instr ORI Instr NP Ratio NP Instr CE Ratio CE complex_multiply % % complex_update % % convolution % % dot_product % % fir % % fir2dim % % iir_biquad_N_sections % % iir_biquad_one_section % % lms % % matrix % % matrix1x % % n_complex_updates % % n_real_updates % % real_update % % average 12.96% 0.50%

Improved Total Performance Ratio : DSPstone BenchmarkInstr ORI Instr CE Ratio IMP complex_multiply % complex_update % convolution % dot_product % fir % fir2dim % iir_biquad_N_sections % iir_biquad_one_section % lms % matrix % matrix1x % n_complex_updates % n_real_updates % real_update % average19.78%