Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre,

Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre, andrea.marongiu3}@.unibo.it

Introduction Writing software for embedded systems

Writing software for embedded system Traditional method: microprocessor emulation. –The software engineer: 1.develops his code on a PC, workstation; 2.uses the emulator as a window into the system. Alternative approach: –ready-built prototyped board; –a way to download and debug the code on the target board has to be supplied.

The Compilation Process Middleware and OS integration The Toolchain as a set of computer programs (tools) that are used to create a product (typically another program). Native versus cross-compilers. Run-time libraries. –Processor/architecture dependent (system call) Memory allocation, task controlm semaphores. –IO dependent. Writing and linking additional libraries. –Embedded OS like RTEMS. –Communication libraries, for instance Message Passing support library like MP-Queue.

Dowloading binary code into the target platform Serial lines, parallel ports 1.The host is connected to the onboard debugger via a serial comms port. 2.A dowload command sends the file to the target debugger. 3.The target debugger converts the ASCII format back into binary and loads at the correct location. Alternative: burn the program into EPROM or other form of non-volatile memories like FLASH. If we are simulating on a virtual platform (emulator), the cross- compiled binary file has to be put into the correct path. From that location, the it will be picked up and the code loaded at bootstrap by the simulator: –Possibly a different binary may be bounded to a different virtual core.

Instruction Set Simulator (ISS) An Instruction Set Simulator (ISS) is a simulation model, usually,coded in a high-level language.simulationmodel It mimics the behavior of a another hardware device or microprocessor by:microprocessor – "reading" instructions –maintaining internal variables which represent the processor's registers. registers The number of instructions to perform the above basic "loop" : –Fetch –Execute/ –Calculate new address depends on hardware but – requires multiple local host instructions for simulating a single target instruction of the ISS.

Debugging techniques High level language simulation –Directly on the host machine used for developing, by means of Linux threads/ processes and IPC facilities. Task level debugging –Operating system may provide breakpointing facilities on system circumstancies, like: Events, Messages, Interrupt Routines. Low level simulation –ISS simulation, slower but more accurate (even cycle accurate). Onboard debugging –Code has to be dowloaded into the target evaluation board; –A remote terminal can be attached and run on the host machine to monitor the program execution.

Cross Compiler A compiler capable of creating executable code for a platform other than the one on which the compiler is run. Its fundamental use is that of separating the build environment from the target environment. –Embedded systems have limited resources, often not powerful enough to run a compiler or a developing environment (debugging). –A single build environment can be set up to compile different (multiple) targets.

GNU Toolchain The set of programming tools used for programming both application and operating system. A vital component in Linux kernel development. A standard tool when developing software for embedded systems. Projects included in the GNU toolchain are: –GNU make –GNU Compiler Collection (GCC) –GNU Binutils

Compiling code #include #define TRUE 1 #define FALSE 0 main() { int i; i = 5 * 2; printf(“5 times 2 is %d.\n”, i); printf(“TRUE is %d.\n”, TRUE); printf(“FALSE is %d.\n”, FALSE); } #include #define TRUE 1 #define FALSE 0 main() { int i; i = 5 * 2; printf(“5 times 2 is %d.\n”, i); printf(“TRUE is %d.\n”, TRUE); printf(“FALSE is %d.\n”, FALSE); } Handled by the pre-processor Handled by the compiler Handled by library routines

Compiling code pre-processor compiler assembler linker source code header files assembler.s object code executable file compiler binutils

Compiling code The pre-processor handles –Macros (#define) –Inclusions (#include) –Conditional code inclusion (#ifdef, #if) –Language extensions (#pragma). The compiler processes source code and turns it into assembler modules. The assembler converts them to hexadecimal. The linker takes the object files and searches library files to find the routines it calls. It calculates the address references and incorporates any symbolic information to create an executable file format.

Runtime Libraries Compilers only generate a small subset of high-level languages facilities and commands from built-in routines. It relies on libraries to provide the full range of functions that the language offers: –Processor dependent: mathematical functions, string manipulation and similar features that use the processor and don’t need to communicate with peripherals; –I/O dependent: defines the hardware that the software need to access. The library routine either drives directly the hardware or calls the operating system to perform its task; –System calls: typical routines are those which dinamically allocate memory, task control commands, use semaphores, etc; –Exit routines: used to terminate programs free up the memory used by the application.

Part II The role of compiler Program written in a Programming Language Assembly Language Translation Compiler

High-level View of a Compiler Source code Machine code Compiler Errors Must recognize legal (and illegal) programs – Understand and preserve the meaning of the source program. Must generate correct code – Map the functionality of the source program to the target (usually the ISA of some computer system) Must introduce optimizations on the original code 1.Performance/Speed 2.Code size 3.Power consumption

Traditional Two-pass Compiler Use an intermediate representation ( IR ) Front end maps legal source code into IR Back end maps IR into target machine code Source code Front End Errors Machine code Back End IR

Must encode all language specific knowledge in each front end Must encode all features in a single IR Must encode all target specific knowledge in each back end Fortran C, C++ Java PASCAL Front end Front end Front end Front end Back end Back end Target 2 Target 1 Target 3 Back end Multiple front and back ends

Anatomy of a Compiler Code Optimizer Code Generator Optimized Intermediate Representation Assembly code Intermediate Representation Semantic Analyzer Lexical Analyzer (Scanner) Syntax Analyzer (Parser) Token Stream Parse Tree Program (character stream) FRONT END MIDDLE END BACK END

Lexical Analyzer (Scanner) Num(234)mul_oplpar_opNum(11)add_oprpar_op 2 3 Num(-22) 18..23 + val#ue Not a number Variable names cannot have ‘#’ character 4*(11+-22)

Syntax Analyzer (Parser) ES. x + 2 - y This contains a lot of unneeded information. termoptermexpr termexpr goal expr op + - 1. goal  expr 2. expr  expr op term 3. | term 4. term  number 5. | id 6. op  + 7. | -

Syntax Analyzer (Parser) int * foo(i, j, k)) int i; int j; { for(i=0; i j) { fi(i>j) return j; } Extra parentheses Missing increment Not an expressionNot a keyword

Semantic Analyzer int * foo(i, j, k) int i; int j; { int x; x = x + j + N; return j; } Type not declared Mismatched return type Uninitialized variable used Undeclared variable

Anatomy of a Compiler Code Optimizer Code Generator Optimized Intermediate Representation Assembly code Intermediate Representation Semantic Analyzer Lexical Analyzer (Scanner) Syntax Analyzer (Parser) Token Stream Parse Tree Program (character stream) MIDDLE END BACK END

Constant Propagation int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x + b*y; } return x; b*0; Uses of variables initialized with a constant value are replaced with the value itself

Algebraic Simplification int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x; } return x; + b*0; Simple algebraic expressions are evaluated and replaced with the resulting value

Copy Propagation int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+1)*(i+1); x = x; } return x; Targets of direct assignments of the form y=x are replaced with their values Another example: y = x; z = 3 + y; Targets of direct assignments of the form y=x are replaced with their values Another example: y = x; z = 3 + y;

Common Subexpression Elimination int i, x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { } return x; x = x + (4*a/b)*i + (i+1)*(i+1); int i, x, y, t; t = i+1; t * t; A subexpression that occurs more than once is replaced with the use of a temporary variable initialized with the subexpression

Dead Code Elimination x = 0; y = 0; for(i = 0; i <= N; i++) { } return x; int i, x, y, t; x = x + (4*a/b)*i + t * t; t = i+1; int i, x, t; Code that is never reached during execution, or assignments to variables that are never used are considered dead and removed

Loop Invariant Removal x = 0; for(i = 0; i <= N; i++) { } return x; x = x + (4*a/b)*i + t * t; t = i+1; u = (4*a/b); x = x + u * i + t * t; int i, x, t;int i, x, t, u; Expressions within a loop that are independent from the iteration count are moved outside the loop body and computed only once

Anatomy of a Compiler Code Optimizer Code Generator Optimized Intermediate Representation Assembly code Intermediate Representation Semantic Analyzer Lexical Analyzer (Scanner) Syntax Analyzer (Parser) Token Stream Parse Tree Program (character stream) BACK END

Code Generator sumcalc: xorl %r8d, %r8d xorl %ecx, %ecx movl %edx, %r9d cmpl %edx, %r8d jg.L7 sall $2, %edi.L5: movl %edi, %eax cltd idivl %esi leal 1(%rcx), %edx movl %eax, %r10d imull %ecx, %r10d movl %edx, %ecx imull %edx, %ecx leal (%r10,%rcx), %eax movl %edx, %ecx addl %eax, %r8d cmpl %r9d, %edx jle.L5.L7: movl %r8d, %eax ret int sumcalc(int a, int b, int N) { int i, x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+1; x = x + u * i + t * t; } return x; } First role of the backend is that of mapping the optimized code (IR) into the instructions of the target machine ISA Some optimizations (register allocation, instruction scheduling) need knowledge of some details of the target architecture. They are implemented in the backend..but it is not the only one!

Back-end optimizations: Instruction scheduling Many pipeline stages –Pentium 5 –Pentium Pro10 –Pentium IV (130nm)20 –Pentium IV (90nm)31 Different instructions taking different amount of time to execute Most modern processors have multiple execution units (superscalar) –If the instruction sequence is correct, multiple operations will happen in the same cycles –Even more important to have the right instruction sequence Reorder instructions so that pipeline stalls are minimized

Data Dependency between Instructions If two instructions access the same variable, they can be dependent Kind of dependencies –True: write  read –Anti: read  write –Output: write  write What to do if two instructions are dependent. –The order of execution cannot be reversed –Reduce the possibilities for scheduling

Using a dependence DAG (Directed Acyclic Graph), one per basic block Nodes are instructions, edges represent dependencies 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r2 - 1 Edge is labeled with Latency: Representing Dependencies 3 12 4 22 2

Example 1: lea var_a, %rax 2: add $4, %rax 3: inc %r11 4: mov 4(%rsp), %r10 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 7: imul %rax, %rbx Results In 1 cycle 3 cycles 1 cycle 4 cycles 3 cycles 1234st 56 7 14 cycles

List Scheduling Algorithm Create a dependence DAG of a basic block –Do a topological sort of the dependence DAG –Consider when an instruction can be scheduled without causing a stall –Schedule the instruction if it causes no stall and all its predecessors are already scheduled Topological Sort –READY = nodes with no predecessors. Loop until READY is empty Heuristics for selecting from the READY list –pick the node with the longest path to a leaf in the dependence graph –pick a node with most immediate successors

Example 1 6 8 2 7 9 1 1 3 4 1 4 5 3 3 d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 1: lea var_a, %rax 2: add $4, %rax 3: inc %r11 4: mov 4(%rsp), %r10 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 7: imul %rax, %rbx 8: mov %rbx, 16(%rsp) 9: lea var_b, %rax Longest path to a leaf node Number of successors

Example 1 1 6 6 8 2 7 9 1 1 3 4 1 4 4 5 3 3 3 d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY set 612473589 { 6, 1, 4, 3 } { 1, 4, 3 } { 4, 3 }  { 2, 4, 3 } { 4, 3 }  { 4, 7, 3 } { 7, 3 }  { 7, 3, 5 } { 3, 5 }  { 3, 5, 8, 9 } { 5, 8, 9 } { 8, 9 } { 9 } { }

Example 14 cycles vs 9 cycles 1: lea var_a, %rax 2: add $4, %rax 3: inc %r11 4: mov 4(%rsp), %r10 5: add %r10, 8(%rsp) 6: and 16(%rsp), %rbx 7: imul %rax, %rbx 8: mov %rbx, 16(%rsp) 9: lea var_b, %rax 1234st 56 7 612473589

Anatomy of a Compiler Code Optimizer Code Generator Optimized Intermediate Representation Assembly code Intermediate Representation Semantic Analyzer Lexical Analyzer (Scanner) Syntax Analyzer (Parser) Token Stream Parse Tree Program (character stream) FRONT END MIDDLE END BACK END

GCC Internals Multiple front-ends Common intermediate representation Retargetable!

IR – Control Flow Graph Most analysis/optimization passes inside a compiler are performed on CFGs A CFG is a directed graph which models flow of control in the program. Each node corresponds to a basic block, i.e. a sequence of non-branch instructions. Edges correspond to possible tansfer of control between blocks

Control Flow Graph into add(n, k) { s = 0; a = 4; i = 0; if (k == 0) b = 1; else b = 2; while (i < n) { s = s + a*b; i = i + 1; } return s; } s = 0; a = 4; i = 0; k == 0 b = 1;b = 2; i < n s = s + a*b; i = i + 1; return s;

Basic Block Construction s = 0; a = 4; Start with instruction control-flow graph (each basic block contains a single instruction) Visit all edges in graph Merge adjacent nodes if –Only one edge from first node –Only one edge into second node s = 0; a = 4;

s = 0; a = 4; i = 0; k == 0 b = 1;b = 2; i < n s = s + a*b; i = i + 1; return s; s = 0; a = 4; i = 0; k == 0 b = 1;b = 2; i < n s = s + a*b; i = i + 1; return s; s = 0; a = 4; s = 0; a = 4; i = 0; s = s + a*b;

Optimizing for parallel architectures Automatic Loop Parallelization

Types of Parallelism Instruction Level Parallelism (ILP) Task Level Parallelism (TLP) Loop Level Parallelism (LLP) or Data Parallelism Pipeline Parallelism Divide and Conquer Parallelism  Scheduling and Hardware  Mainly by hand  Generated by Hand or Compiler  Hardware or streaming  Recursive functions

Loop parallelization Why loops? –90% of the execution time in 10% of the code Mostly in loops –If parallel, can get good performance Load balancing –Relatively easy to analyze How to automatically parallelize loops? –Find FORALL loops out of FOR loops Data dependence analysis Definition –Loop-carried dependence: dependence that crosses a loop boundary If there are no loop carried dependences  parallelizable

Programmer Defined Parallel Loop FORALL –No “loop carried dependences” –Fully parallel FORACROSS –Some “loop carried dependences”

Loop Splitting Example FORPAR I = 0 to N A[I] = A[I] + 1 Block Distribution: Program gets mapped into Iters = ceiling(N/NUMPROC); FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N) A[I] = A[I] + 1

Code transformation int main() { … for (i=0; i<N; i++) for (j=i; j<N; j++) A[i][j] = 1; … } int main() { … for (i=0; i<N; i++) for (j=i; j<N; j++) A[i][j] = 1; … } sequential code parallel code PARALLELIZING COMPILER void parallel_routine() { } int start() { … for (i=0; i<N; i++) for (j=i; j<N; j++) A[i][j] = 1; … } for (i=N*cpuID/nprocs; i<N*(cpuID+1)/nprocs; i++) for (j=i; j<N; j++) A[i][j] = 1.0; do_all();

Runtime support to parallelization parallel code void parallel_routine() { } int start() { // sequential code for (i=0; i<N; i++) for (j=i; j<N; j++) A[i][j] = 1; // sequential code } for (i=N*cpuID/nprocs; i<N*(cpuID+1)/nprocs; i++) for (j=i; j<N; j++) A[i][j] = 1.0; do_all(); void main() { initenv(); if (cpuID == MASTER) { // gather workers on barrier start(); // release workers } else { // spin until work provided parallel_routine(); // spin until work provided } void doall() { // release workers parallel_routine(); // gather workers on barrier } // Synchronization facilities // Lock Implementation // Barrier Implementation void main() { initenv(); if (cpuID == MASTER) { // gather workers on barrier start(); // release workers } else { // spin until work provided parallel_routine(); // spin until work provided } } void doall() { // release workers parallel_routine(); // gather workers on barrier } // Synchronization facilities // Lock Implementation // Barrier Implementation runtime library

Cooperative approaches: OpenMP The compiler may not be able to do the parallelization in the way you like to see it: –A loop is not parallelized The data dependency analysis is not able to determine wheter it is safe to parallelize or not –The granularity is not high enough The compiler lacks information to parallelize at the highest possible level This is when the explicit parallelization through OpenMP directives comes into the picture

OpenMP and GCC Language extensions for shared memory concurrency (#pragma) Supports C, C++ and Fortran Embedded directives specify –Parallelism –Data sharing semantics –Work sharing semantics Standard and increasingly popular

OpenMP programming model A typical OpenMP implementation relies on the pthreads library. With MPSoCs the SPMD paradigm is also used Based on fork-join semantics –Master thread spawns teams of children threads –Allows sequential and parallel execution

Supporting OpenMP in GCC Recognize OpenMP pragmas –Within the parser in each frontend Lower into GIMPLE IR –Augment it with new OMP nodes Identify data sharing –The master thread creates a local structure which contains all the data items marked for sharing and passes its address to slaves Identify work sharing –Split computation between different cores (omp for) Identify parallel regions –Outline the body of parallel regions into functions that are used as arguments to the thread creation routines

Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre,

Similar presentations

Presentation on theme: "Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre,

Similar presentations

Presentation on theme: "Writing software for embedded systems Part I: Introduction, design flow Part II: The role of compilers Alessandro Dalla Torre, Andrea Marongiu {alessandro.dallatorre,"— Presentation transcript:

Similar presentations

About project

Feedback