Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson.

Slides:

Advertisements

Similar presentations

CSC 4181 Compiler Construction Code Generation & Optimization.

Advertisements

DSPs Vs General Purpose Microprocessors

19 Classic Examples of Local and Global Code Optimizations Local Constant folding Constant combining Strength reduction.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

1 Chapter 8: Code Generation. 2 Generating Instructions from Three-address Code Example: D = (A*B)+C =* A B T1 =+ T1 C T2 = T2 D.

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

CPSC Compiler Tutorial 8 Code Generator (unoptimized)

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.

Run-Time Storage Organization

Introduction to Program Optimizations Chapter 11 Mooly Sagiv.

Run time vs. Compile time

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Parallelizing Compilers Presented by Yiwei Zhang.

Intermediate Code. Local Optimizations

1 Lecture 7: Computer Arithmetic Today’s topics:  Chapter 2 wrap-up  Numerical representations  Addition and subtraction Reminder: Assignment 3 will.

Improving Code Generation Honors Compilers April 16 th 2002.

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.

Fast, Effective Code Generation in a Just-In-Time Java Compiler Rejin P. James & Roshan C. Subudhi CSE Department USC, Columbia.

Introduction of Intel Processors

CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

Execution of an instruction

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.

RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,

Kirk Scott Computer Science The University of Alaska Anchorage 1.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Chapter 7 Object Code Generation. Chapter 7 -- Object Code Generation2  Statements in 3AC are simple enough that it is usually no great problem to map.

Project Presentation by Joshua George Advisor – Dr. Jack Davidson.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Code Optimization Overview and Examples

Code Optimization.

Assembly language.

Optimization Code Optimization ©SoftMoore Consulting.

Henk Corporaal TUEindhoven 2009

Optimizing Transformations Hal Perkins Autumn 2011

Optimizing Transformations Hal Perkins Winter 2008

Henk Corporaal TUEindhoven 2011

Code Optimization Overview and Examples Control Flow Graph

Instruction Level Parallelism (ILP)

Dynamic Hardware Prediction

Understanding the TigerSHARC ALU pipeline

Lecture 4: Instruction Set Design/Pipelining

Lecture 5: Pipeline Wrap-up, Static ILP

CSc 453 Final Code Generation

Computer Architecture Assembly Language

Code Optimization.

Presentation transcript:

Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson

Status Register assignment and allocation Common sub-expression elimination Constant propagation/Copy propagation Induction variable elimination Code motion Recurrence detection

Status (continued) Strength reduction Instruction selection Dead code elimination Constant folding (simp()) Branch minimization Support for repeat blocks

The tms320c54x 1 40-bit ALU, 2 40-bit accumulators (A,B) (r[0],r[2] in vpo) 1 17x17bit parallel multiplier with adder for single cycle MAC operation 1 barrel shifter 8 16-bit address registers (AR0-AR7) (w[0]..w[7] in vpo)

Compiler writer woes Address arithmetic – can only add a constant to an address register. Causes complications in optimizer (eg. in strength reduction code). Interesting note: r[0]=(w[0]{24)}24; r[0]=r[0]+1; w[1]=r[0]; /* w[1]=w[0]+1 gets rejected */ W[w[1]]=50; /* by instruction selection */ w[0]=w[0]+1; W[w[0]+1]=50; The first sequence cannot normally collapse into the more efficient second sequence. But after minimize_registers, instruction selection is able to fold them into a single instruction.

Compiler writer woes 16 bit word addressing – required special case handling in lcc frontend. Only 2 accumulator registers. Local Register Assigner had to be fixed to handle this. Lots of spills. Refined vpo to use memory disambiguation techniques in instruction selection (maybe_same()).

Compiler writer woes No pipeline interlocks => unprotected pipeline conflicts. 40 bit accumulator. Needed major change to simp(). Complicated machine description with sign-extends and ANDs. Global data placed in special cinit section and is relocated to RAM at run-time. VISTA/EASE code instrumentation had to be done differently from other targets.

Compiler writer woes Compare and jump has the induction variable and the value to compare with, spread over two instructions. All targets till now had a simple compare and jump. Resulted in small change to vpo lib/md interface. Eg. AR1 (w[1]) is the induction variable and runs from 0 to 9. The loop exit check – SSBX SXM// s[0]=1; (set sign-ext on) LD *(AR1),A ; // r[0]=(w[1]{24)}24; SUB #10,A,A ; // r[0]=r[0]-10; BC L1,ALT ; // PC=r[0],0?L1;

Timeline of progress on this project Spring 2002 Code-expander completed. Only basic addressing modes and instructions supported. Stack layout Calling sequence Data declarations Structure operations Passes ctests/ptests with instruction selection. Support for stdargs added.

Timeline of progress on this project Fall 2002 Major changes to simp() to handle 40 bit arithmetic. Enabled Register Coloring and CSE. Lot of work on comp() to allow better instruction selection and other optimizations. (eg. w[1]=( (w[1]{24)}24)+1 ) & folds down to w[1]=w[1]+1; <- only now strength reduction can detect the induction variable) Integrated VISTA into mainline vpo.

Timeline of progress on this project Spring 2003 Enabled Code motion & Strength reduction. Further refined the machine description/grammar. Started work on Zero Overhead Loop Buffer (ZOLB) support. Second merge of VISTA with vpo done. Retargeted VISTA to the tms320c54x.

To-Dos/Future work Parallel instructions Issues with ZOLB (details later) Scheduling The banz instruction (very useful for loops) – allows comparison of an address register with zero Circular addressing

TI’s compiler cl500 has.. Inter-procedural analysis For eg. if the parameters to a function are constants or globals, the actual parameters are substituted into the function, thus avoiding expensive stack frame setup. Inline expansion of runtime-support library functions.

Code comparison Code Fragment: Get address of local _a r[2]=(w[7]{24)}24; r[2]=r[2]+_l0_2_a; w[3]=r[2]&65535; // w[3]=w[7]+_l0_2_a w[3]=w[7]; w[3]=w[3]+_l0_2_a; VPO cl500 (TI- compiler)

Code comparison Code fragment: for (i = 0; i < STRUCTSIZE; i++) // STRUCTSIZE=2 sum += b.field[i]; Because vpo maintains the running sum in a 16 bit register (address register) we use 2 extra instructions and lose the opportunity for converting into a repeat single instruction. The TI-compiler maintains the sum in an accumulator register.

AR3 (w[3]) points to start of array. AR1 maintains the running count. brc=1; rptb.L10_rpt_end-1.L10: ld *(AR1),A // r[0]=(w[1]{24)}24; add *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; stl A,*(AR1) // w[1]=r[0]&65535;.L10_rpt_end: AR3 (w[3]) points to start of array. A (r[0]) maintains the running count. RPT #1 L5: ADD *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; L6: VPO cl500 (TI- compiler)

Zero Overhead Loop Buffers Loops are buffered in a special internal buffer using a rpt instruction whose parameters are start label, end label and loop count. Access to this buffer may be faster than fetching the instructions from memory. The usual branch instruction at the end of the loop is no longer necessary when using a repeat instruction, and hence pipeline bubbles are avoided. On the tms320c54x a single instruction rpt allows memory block copies/initializations without using an address register.

Detail on ZOLB Advantage of doing it in vpo Can make use of all the information that vpo has already collected about the loop. Easily retargetable Code in machine independent part is reused. Code in machine dependent part for one target provides a framework for the new target. After conversion to a Repeat Block, registers may be freed up. Other optimizations may get enabled.

Status of ZOLB Repeat Blocks with compile time known loop iteration count implemented. Plan to implement the banz instruction which is the next best option to ZOLB.

Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman