Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA.

Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA

2 Introduction New media processors need fancy compilers –instruction level parallelism –memory hierarchy –multimedia instructions Issues are moving from code generation (i.e. instruction selection, register allocation) to code optimizations –technique from high performance computing are needed Highly optimized but short codes Retargeting must be fast

3 Compilation for Embedded Applications Specific Issues Asymptotic performance is not the only goal Hardware may not exist when developing the system Retargetabiblity is essential User ready to more efforts –cost versus performance –tuning is a large part of the development time Compilation time is not critical

4 Current Compilers Are Not Enough Users need more control –many heuristics to “drive” –different optimization strategies on different code parts Feedback needed Not only used for compiling but also for processor design Must adapt to “weird” processor architectures

5 Infrastructures rather than Compilers Flexibility is critical Open software rather than black boxes Machine description must exists for –code generation : usually the case –code scheduling : instruction resource uses must be described Very few infrastructures for low level optimizations

6 Our View of Compiler Infrastructure for Embedded Systems Interface Front-end Target Description(s) Code Generation Back-end Optimizer Instruction set Simulator tool 2 tool 1 tool 3 tool 4 feedback

7 Overview Salto Assembly source to source preprocessor –to instrument or transform assembly code, –to schedule assembly code, –for register allocation, basic bloc layout, etc. Fine grain machine description Independent from compilers Transformation tool SALTO interface C++ Machine Description assembly language

8 Code Abstraction block_begin block_end label asm 1 asm 2 branch CFG 3 CFG 4 BB 1 BB 2 list of procedureslist of basic blockslist of instructions

9 Code Abstraction _gcd: save %sp,-104,%sp cmp %i0, %i1 bge L9 mov %i0,%o0 NOT_TAKEN TAKEN L9:.......... TAKEN xor %i1,%i0,%i1 xor %i0,%i1,%i0 mov %i0,%o0 xor %i0,%i1,%i0 RAW(delay=1cycle)

10 Machine Description Machine resource Class of machine resources Assembly code description Resource usage description register file ALU memory _main: !#PROLOGUE# 0 save %sp,-104,%sp !#PROLOGUE# 1 call ___main,0 nop cmp %i0,1 bg L2 sethi %hi(__iob+40),%o0 or %o0,%lo(__iob+40),%o0 sethi %hi(LC0),%o1 or %o1,%lo(LC0),%o1 call _fprintf,0 issue reg 1 reg 2 ALU reg 3 0 1 2 3cycle use read write read use

11 Some Application Examples Code optimization –a prototype software pipeliner for TriMedia Code instrumentation –Calvin a tool for trace collection Simulation –Absciss a compiled instruction set simulator generator Code tuning –assembly code tuning tool

12 A Prototype Software Pipeliner Source code Code generator Salto opt. asm file Sea Pilo/Lora SP Pilo/Lora SP Difficult to implement Separate implementation scheduling computation Interface with compiler data dependencies loop structure asm fileIL file interface with the compiler interface with the abstract scheduler

13 IL Example Keywords = { Loop,.. } DefaultLevel = asm #(BB 1, DESC = { #(INST 38)...} ) #(SS 1, DESC = { #(BB 1) }) #(SS 1).ToOptimize := true #(SS 1).Loop := true #(SS 1).body := #(BB 1) #(SS 1).loopBack := { #(INST 57) }... for(i=0; i < 100; i++) { a[i] = b[i] + c[i]; } __ main_DT_3:.....instid 54 iadd IF r1 r0 r122 -> r9;.instid 55 h_st32d IF r1(0) r127 r124;.instid 56 ijmpf IF r1 r125 r126 ;.instid 57 ijmpi IF r125(__main_DT_3); __main_DT_4: Code generator

14 Some Issues Instruction slots does not fit in machine descriptions Need to modify loop branches and counters Extensive code structure modifications (register renaming, unrolling, loop test, remainder loop,...) May generate very large code Not efficient for small iteration numbers

15 Code Instrumentation Collecting data for hardware studies –trace collections Extracting information for code optimizations –counting dynamic events –coupling code to simulators (for instance cache memories) Instrumentation is sometimes more convenient and flexible than simulation

16 An Example: Calvin Salto foo.i.s foo.s foo.i.s Salto light instrumentation counting events Simulation library

17 High Performance Instruction Set Simulators What for –evaluate ISA –develop compiler back-end –debug applications Requirements –high performance –flexible –targets VLIW and superscalars

18 Approach Compiled instruction set simulator Assembly input, no need to specify binary encoding Use static information to speedup resulting code Generate only the needed events Resulting code can be edited and modified Separates of scheduling computation and simulator generation Use Salto machine description enhanced with instruction semantics

19 Overview of Current Experiments C Source TriMedia Assembly code tmcc TriMedia Binary Absciss tmsim tmas g++/ld C++ Source compiled simulator Salto Architecture Description

20 Absciss Instruction Descriptions Salto description Pseudo assembly instructions are used for semantics (RTL based on an extension of Zephyr operators) Example 1 –modulo add: iadd r,s -> d –semantics: $3=add($1,$2) Example 2 –load 2 bytes: uld16r r,s -> d –semantics: $3=zx(mem(add($1,$2),2),32) Timings are given by the scheduler for “static” processors (i.e. VLIW)

21 Absciss Code Optimizations Similar to classical compiler optimizations –Constant folding r8 = addsu( 255, r4 ); r8 = 255; –Simulation dead code removal r6 = 0; r6 = r12; r6 = r12; –... Exploit host compiler optimizations

22 Performance Example 1 int f(int n, char *v1, char *v2) { int i; int sum = 0; for (i = 0; i < n; i++) { sum = sum + v1[i]*v2[i]; } return sum; }... f(...)... generation, compilation, execution

23 Performance Example 2 Inverse Discrete Cosine Transform static void idctrow(short *blk) { int x0, x1, x2, x3, x4, x5, x6, x7, x8; if (!((x1 = blk[4]<<11) | (x2 = blk[6]) | (x3 = blk[2]) | (x4 = blk[1])|(x5=blk[7])|(x6=blk[5])|(x7 = blk[3]))) { blk[0]=blk[1]=blk[2]=blk[3]=blk[4]=blk[5]=blk[6]=blk[7]=blk[0]<<3; return; } x0 = (blk[0]<<11) + 128; x8 = W7*(x4+x5); x4 = x8 + (W1-W7)*x4; x5 = x8 - (W1+W7)*x5;...

24 Performance Example 2 -O0 -O1tmsim

25 A Code Tuning Tool Fine grain tuning needs to check the assembly code and profiling data –Did the compiler performed properly? –Feedback to set pragma and other source code transformations –Direct use of the scheduler,... Checking usually performed by hand Extends use of Salto

26 A Graphical Interface to Salto server S A L T O Analysis and optimizing algorithms Communication layer Client 1Client 2Client N... assemblysource assembly network

27 Current Prototype

28 Summary of the Approach Salto machine description allows for a large range of uses Assembly code better than binary for optimization Syntax issue is difficult Information from the compiler needed Salto lacks support for writing optimizations Abstract optimization algorithms but not enough In some cases need to mix code generation and scheduling

29 Related Works Optimizations and compilers –ISDL (http://caa.lcs.mit.edu/caa/) –SPAM (http://www.ee.princeton.edu/spam) –ZEPHYR (http://www.cs.virginia.edu/zephyr/) –IMPACT (http://www.crhc.uiuc.edu/Impact/) –Trimaran (http://www.trimaran.org/) –... Compiled simulation –AXYS,... Assembly related tuning tools –VTune for x86,...

30 Salto2 Extends and specializes Salto Better suited for code optimizations Global scheduling techniques and register allocation Better integration with compiler back-end

31 Basic Principles Abstracting algorithms Support multiple alternatives on the same piece of code Better integration with compilers User interface Capabilities for simple code generation High level code abstractions (loops, superblocks,...)

32 Structure Overview Architecture Description D ® M Architecture Model Intermediate representation Opt 1Opt 2Opt n P ® RI Text Input D ® Ass (Emit) Optimized Program interface to IR Interfaces External Infrastructure User interface G.U.I. Intermediate Code

33 Conclusion There is room for new compiler infrastructures Optimizations more important than code generation A black-box approach is not enough, interaction with the programmer must be easier Implementing new optimizing algorithms must be fast Separate parts, in the compiler infrastructure, that move faster (processor implementations move faster than ISA) Must be integrated in the processor design flow

Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA.

Similar presentations

Presentation on theme: "Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA.

Similar presentations

Presentation on theme: "Infrastructures for Assembly Level Tools F. Bodin CAPS Team IRISA-INRIA."— Presentation transcript:

Similar presentations

About project

Feedback