Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging Systems (CenSSIS) Northeastern University Jeffrey Smith Mercury Computer
Why Consider Using Profile-Guided Optimization? Much of the potential performance available on data- parallel systems can not be obtained due to unpredictable control flow and data flow in programs Memory system performance continues to dominate the performance of many data-parallel applications Program profiles provide clues to the compiler/linker/runtime to: –Enable more aggressive use of interprocedural optimizations –Eliminate bottlenecks in the data flow/control flow and –Improve a program’s layout on the available memory hierarchy Applications can then be developed at higher levels of programming abstraction (e.g., from UML) and tuned for performance later
Profile Guidance Obtain run-time profiles in the form of: –Procedure call graphs, basic block traversals –Program variable value profiles –Hardware performance counters (using PCL) Cache and TLB misses, pipeline stalls, heap allocations, synchronization messages Utilize run-time profiles as input to: –Provide user feedback (e.g., program visualization) –Perform profile-driven compilation (recompile using the profile) –Enable dynamic optimization (just-in-time compilation) –Evaluate software testing coverage
Profiling Tools Mercury Tools TATL – Trace Analysis Tool and Library Procedure profiles Gnu gprof PowerPC Performance Counters PCL – Performance Counter Library PM API – targeting the PowerPC Greenhills Compiler MULTI profiling support Custom instrumentation drivers
SAR Program profile counter values program paths variable values COMPILERCOMPILER Feedback Compile-time Optimizations Data Parallel Applications Program Binary Binary-level Optimizations Feedback GPR Software Defined Radio MRI Program run
Target Optimizations Compile-time –Aggressive procedure inlining –Aggressive constant propagation Program variable specialization Procedure cloning –Removal of redundant loads/stores Link-time –Code reordering utilizing coloring –Static data reordering Dynamic (during runtime) –Heap layout optimization
Memory Performance is Key to Scalability in Data-parallel applications The performance gap between processor technology and memory technology continues to grow Hierarchical memory systems (multi-level caches) have been used to bridge this gap Embedded processing applications place a heavy burden on the supporting memory system Applications will need to adapt (potentially dynamically) to better utilize the available memory system
Cache Line Coloring Attempts to reorder a program executable by coloring the cache space, avoiding caller-callee conflicts in a cache Can be driven by either statically-generated call graphs or profile data Improves upon the work of Pettis and Hansen by considering the organization of the cache space (i.e., cache size, line size, associativity) Can be used with different levels of granularity (procedures, basic blocks) and both intra- and inter- procedurally
Cache Line Coloring Algorithm Build program call graph –nodes represent procedures –edges represent calls –edge weight represent call frequencies Prune edges based on a threshold value Sort graph edges and process in decreasing edge weight order Place procedures in the cache space, avoiding color conflicts Fill in gaps with remaining procedures Reduces execution time by up to 49% for data compression algorithms A BE 90 40
Data Memory Access A disproportionate number of data cache misses are caused by accesses to dynamically allocated (heap) memory Increases in cache size do not effectively reduce data cache misses caused by heap accesses A small number of objects account for a large percentage of heap misses (90/10 rule) Existing memory allocation routines tend to balance allocation speed and memory usage (locality preservation has not been a major concern)
Miss rates (%) vs. Cache Configurations
Profile-driven Data Layout We have developed a profile-guided approach to allocating heap objects to improve heap behavior The idea is to use existing knowledge of the computing platform (e.g., cache organization), combined with profile data, to enable the target application to execute more efficiently Mapping temporally local memory blocks possessing high reference counts to the same cache area will generate a significant number of cache misses
Allocation We have developed our own malloc routine which uses a conflict profile to avoid allocating potentially conflicting addresses A multi-step allocation algorithm is repeated until a non-conflicting allocation is made If all steps produce conflicts, allocation is made within the wilderness region If conflicts still occur in the wilderness region, we allocate these conflicting chunks (creating a hole) Allocation occurs at the first non-conflicting address after the chunk The hole is immediately freed, causing minimal space wastage (though possibly some limited fragmentation)
Runtime improvements over non-optimized heap layout
Future Work Present algorithms have only been evaluated on uniprocessor platforms Follow-on work will target Mercury RACE multiprocessor systems Target applications will include: –FM3TR for Software Defined Radio –Steepest Decent Fast Multipole Methods (SDFMM) and Method for demining applications
“Improving the Performance of Heap-based Memory Access,” E. Yardimci and D. Kaeli, Proc. of the Workshop on Memory Performance Issues, June “Accurate Simulation and Evaluation of Code Reordering,” J. Kalamatianos and D. Kaeli, Proc. of the IEEE International Symposium on the Performance Analysis of Systems and Software, May “`Model Based Parallel Programming with Profile-Guided Application Optimization,” J. Smith and D. Kaeli, Proc. of the 4th Annual High Performance Embedded Computing Workshop, MIT Lincoln Labs, Lexington, MA, September 2000, pp “Cache Line Coloring Using Real and Estimated Profiles,” A. Hashemi, J. Kalamatianos, D. Kaeli and W. Meleis, Digital Technical Journal, Special Issues on Tools and Languages, February `` Parameter Value Characterization of Windows NT-based Applications,'‘ J. Kalamatianos and D. Kaeli, Workload Characterization: Methodology and Case Studies, IEEE Computer Society, 1999, pp Related Publications
“Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance,” J. Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli and W. Meleis, IEEE Transactions on Computers, Vol.10, No. 2, February 1999, pp “Memory Architecture Dependent Program Mapping,” B. Calder, A. Hashemi, and D. Kaeli, US Patent No. 5,963,972, October 5, “Temporal-based Procedure Reordering for Improved Instruction Cache Performance,” Proc. of the 4 th HPCA, Feb. 1998, pp “Efficient Procedure Mapping Using Cache Line Coloring,” H. Hashemi, D. Kaeli and B. Calder, Proc. of PLDI’97, June 1997, pp “Procedure Mapping Using Static Call Graph Estimation,” Proc. of the Workshop on the Interaction Between Compilers and Computer Architecture, TCCA News, Related Publications (also see