Download presentation
Presentation is loading. Please wait.
Published byGwen Todd Modified over 9 years ago
1
Retargetting of VPO to the tms320c54x - a status report Presented by Joshua George Advisor: Dr. Jack Davidson
2
Status Register assignment and allocation Common sub-expression elimination Constant propagation/Copy propagation Induction variable elimination Code motion Recurrence detection
3
Status (continued) Strength reduction Instruction selection Dead code elimination Constant folding (simp()) Branch minimization Support for repeat blocks
4
The tms320c54x 1 40-bit ALU, 2 40-bit accumulators (A,B) (r[0],r[2] in vpo) 1 17x17bit parallel multiplier with adder for single cycle MAC operation 1 barrel shifter 8 16-bit address registers (AR0-AR7) (w[0]..w[7] in vpo)
5
Compiler writer woes Address arithmetic – can only add a constant to an address register. Causes complications in optimizer (eg. in strength reduction code). Interesting note: r[0]=(w[0]{24)}24; r[0]=r[0]+1; w[1]=r[0]; /* w[1]=w[0]+1 gets rejected */ W[w[1]]=50; /* by instruction selection */ --------------------------- w[0]=w[0]+1; W[w[0]+1]=50; The first sequence cannot normally collapse into the more efficient second sequence. But after minimize_registers, instruction selection is able to fold them into a single instruction.
6
Compiler writer woes 16 bit word addressing – required special case handling in lcc frontend. Only 2 accumulator registers. Local Register Assigner had to be fixed to handle this. Lots of spills. Refined vpo to use memory disambiguation techniques in instruction selection (maybe_same()).
7
Compiler writer woes No pipeline interlocks => unprotected pipeline conflicts. 40 bit accumulator. Needed major change to simp(). Complicated machine description with sign-extends and ANDs. Global data placed in special cinit section and is relocated to RAM at run-time. VISTA/EASE code instrumentation had to be done differently from other targets.
8
Compiler writer woes Compare and jump has the induction variable and the value to compare with, spread over two instructions. All targets till now had a simple compare and jump. Resulted in small change to vpo lib/md interface. Eg. AR1 (w[1]) is the induction variable and runs from 0 to 9. The loop exit check – SSBX SXM// s[0]=1; (set sign-ext on) LD *(AR1),A ; // r[0]=(w[1]{24)}24; SUB #10,A,A ; // r[0]=r[0]-10; BC L1,ALT ; // PC=r[0],0?L1;
9
Timeline of progress on this project Spring 2002 Code-expander completed. Only basic addressing modes and instructions supported. Stack layout Calling sequence Data declarations Structure operations Passes ctests/ptests with instruction selection. Support for stdargs added.
10
Timeline of progress on this project Fall 2002 Major changes to simp() to handle 40 bit arithmetic. Enabled Register Coloring and CSE. Lot of work on comp() to allow better instruction selection and other optimizations. (eg. w[1]=( (w[1]{24)}24)+1 ) & 65535 folds down to w[1]=w[1]+1; <- only now strength reduction can detect the induction variable) Integrated VISTA into mainline vpo.
11
Timeline of progress on this project Spring 2003 Enabled Code motion & Strength reduction. Further refined the machine description/grammar. Started work on Zero Overhead Loop Buffer (ZOLB) support. Second merge of VISTA with vpo done. Retargeted VISTA to the tms320c54x.
12
To-Dos/Future work Parallel instructions Issues with ZOLB (details later) Scheduling The banz instruction (very useful for loops) – allows comparison of an address register with zero Circular addressing
13
TI’s compiler cl500 has.. Inter-procedural analysis For eg. if the parameters to a function are constants or globals, the actual parameters are substituted into the function, thus avoiding expensive stack frame setup. Inline expansion of runtime-support library functions.
14
Code comparison Code Fragment: Get address of local _a r[2]=(w[7]{24)}24; r[2]=r[2]+_l0_2_a; w[3]=r[2]&65535; // w[3]=w[7]+_l0_2_a ---------------------------- w[3]=w[7]; w[3]=w[3]+_l0_2_a; VPO cl500 (TI- compiler)
15
Code comparison Code fragment: for (i = 0; i < STRUCTSIZE; i++) // STRUCTSIZE=2 sum += b.field[i]; Because vpo maintains the running sum in a 16 bit register (address register) we use 2 extra instructions and lose the opportunity for converting into a repeat single instruction. The TI-compiler maintains the sum in an accumulator register.
16
AR3 (w[3]) points to start of array. AR1 maintains the running count. brc=1; rptb.L10_rpt_end-1.L10: ld *(AR1),A // r[0]=(w[1]{24)}24; add *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; stl A,*(AR1) // w[1]=r[0]&65535;.L10_rpt_end: -------------------------------------------- AR3 (w[3]) points to start of array. A (r[0]) maintains the running count. RPT #1 L5: ADD *AR3+,A // r[0]=r[0]+(W[w[3]]{24)}24;w[3]=w[3]+1; L6: VPO cl500 (TI- compiler)
17
Zero Overhead Loop Buffers Loops are buffered in a special internal buffer using a rpt instruction whose parameters are start label, end label and loop count. Access to this buffer may be faster than fetching the instructions from memory. The usual branch instruction at the end of the loop is no longer necessary when using a repeat instruction, and hence pipeline bubbles are avoided. On the tms320c54x a single instruction rpt allows memory block copies/initializations without using an address register.
18
Detail on ZOLB Advantage of doing it in vpo Can make use of all the information that vpo has already collected about the loop. Easily retargetable Code in machine independent part is reused. Code in machine dependent part for one target provides a framework for the new target. After conversion to a Repeat Block, registers may be freed up. Other optimizations may get enabled.
19
Status of ZOLB Repeat Blocks with compile time known loop iteration count implemented. Plan to implement the banz instruction which is the next best option to ZOLB.
20
Acknowledgements Dr. Jack Davidson (advisor) Jason Hiser Clark Coleman
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.