Performance Optimization for Embedded Software Presented by: Yingjun Lyu
What is Software Optimization? The process of modifying a software system —> work more efficiently or use fewer resources
Do you Optimize your Program?
When to Optimize? A better approach: design first, code from design, profile the code Keep performance goals in mind
Levels of Optimization Design level Algorithms and data structures Source code level while(1) vs for(;;) Build level Compile level Assembly level Run time
The Code Optimization Process Build —> Optimize —> Check outputs Build —> Generate tests —> Optimize —> Check outputs
Basic C Optimization Techniques Choose the right data type Example: a processor does not support a 32-bit multiplication. Use of a 32-bit type in a multiply—> A sequence of 16-bit operations What if only a 16-bit precision is needed? Solution: Use intrinsics to leverage embedded processor features.
An intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler.
Function calling conventions Definition: an implementation-level (low-level) scheme for how callees receive parameters from their caller and how they return a result. Stack-based or Register-based?
Restrict and point aliasing Compiler knows pointers do not alias—>Parallelism
Loops Communicate loop count information: specify the loop count bounds to the compiler Example: Hardware loop: keep the loop body in a buffer or prefetching
General Loop Transformation Loop unrolling Multisampling Partial summation Software pipelining
Loop unrolling: A loop body is duplicated one or more times Loop unrolling: A loop body is duplicated one or more times. The loop count is then reduced by the same factor to compensate.
Multisampling: independent output values that have an overlap in input source data values
Partial Summation: The computation for one output sum is divided into multiple smaller, or partial, sums.
Software pipelining: A sequence of instructions is transformed into a pipeline of several copies of that sequence
Is there any cost for performance optimization?
Example: Loop Unrolling
Code Size Optimization Why? Code Size —> The amount of space in memory the code will occupy at program run-time and the potential reduction in the amount of instruction cache needed by the device.
Compiler flags (configure the compiler) Optimize code size Example: command line option -Os in the GNU GCC compiler Optimize performance O3Os? Critical code is optimized for speed and the bulk of the code may be optimized for size
“Premium encodings”: The most commonly used instructions can be represented in a reduced binary footprint Example: integer add instructions in a 32-bit device are represented with a premium 16-bit encoding Drawback: Performance Degration
Tuning the ABI for code size ABI: application binary interface, an interface between a given program and the OS, system libraries, etc. To reduce code size, there are two areas of interest: calling convention and alignment
Fewer instructions are required for setting up parameters to be passed via registers than for those to be passed via the stack. Calling Convention
Increase cache misses and register pressure Space-time Tradeoff Depend on the unrolling factor Increase cache misses and register pressure
Space-time Tradeoff
Improve Performance through memory layout optimization Vectorization of loops Computation performed across multiple loop iterations can be combined into single vector instructions.
An important concern for vectorizing: Loop Dependence Analysis: array access, data modification, conditional statement, etc Challenge: Pointer aliasing Solution: Place restrict keyword
Array-of-structures or Structure-of-arrays Array-of-structures or Structure-of-arrays? Hint: Memory is most efficiently accessed sequentially.
Source Code Level Optimization Performance bug: Bugs that cause significant performance degradation PerfChecker: a performance bug detection tool for mobile applications (static analysis)
GUI lagging becomes the most dominant bug types(75.7%) Long running operations in main threads
View holder design pattern
[1] Oshana and Kraeling. Software Engineering for Embedded Systems: Methods, Practical Techniques, and Applications - Chapter 11: Optimizing Embedded Software for Performance [2] Oshana and Kraeling. Software Engineering for Embedded Systems: Methods, Practical Techniques, and Applications - Chapter 12: Optimizing Embedded Software for Memory [3] Heydemann, K., Bodin, F., Knijnenburg, P. M. W. and Morin, L. (2006), UFS: a global trade-off strategy for loop unrolling for VLIW architectures. Concurrency Computat.: Pract. Exper., 18: 1413–1434. doi:10.1002/cpe.1014 [4] Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 1013-1024. DOI=http://dx.doi.org/10.1145/2568225.2568229 [5] http://sccpu2.cse.ust.hk/andrewust/files/ICSE2014_presentation.pdf