Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
Xiaomi An, Jiqiang Song, Wendong Wang SimpLight Nanoelectronics Ltd 2008/03/24 Temporal Distribution Based Software Cache Partition To Reduce I-Cache Misses.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Using Sampled and Incomplete Profiles David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Richard Johnson  How can we use the visualization tools we currently have more effectively?  How can the Software Development.
Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
IBM Haifa Labs © 2005 IBM Corporation Performance Tools developed in IBM Haifa Gad Haber
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
CS 598 Scripting Languages Design and Implementation 14. Self Compilers.
Full and Para Virtualization
1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time.
Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
© 2000 Mercury Computer Systems, Inc. 1 CORBA (17 prod units) UML (50 prod units) SCE (40 pr PGM (20 prod SAGE (12 prod units) Model-Based Parallel Programming.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Cache-Conscious Data Placement Adapted from CS 612 talk by Amy M. Henning.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Code Optimization.
Dynamo: A Runtime Codesign Environment
Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 9 – Real Memory Organization and Management
Parallel Algorithm Design
Many-core Software Development Platforms
Improving cache performance of MPEG video codec
CSCI1600: Embedded and Real Time Software
Feedback directed optimization in Compaq’s compilation tools for Alpha
Ann Gordon-Ross and Frank Vahid*
RADEON™ 9700 Architecture and 3D Performance
CSCI1600: Embedded and Real Time Software
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging Systems (CenSSIS) Northeastern University Jeffrey Smith Mercury Computer

Why Consider Using Profile-Guided Optimization? Much of the potential performance available on data- parallel systems can not be obtained due to unpredictable control flow and data flow in programs Memory system performance continues to dominate the performance of many data-parallel applications Program profiles provide clues to the compiler/linker/runtime to: –Enable more aggressive use of interprocedural optimizations –Eliminate bottlenecks in the data flow/control flow and –Improve a program’s layout on the available memory hierarchy Applications can then be developed at higher levels of programming abstraction (e.g., from UML) and tuned for performance later

Profile Guidance Obtain run-time profiles in the form of: –Procedure call graphs, basic block traversals –Program variable value profiles –Hardware performance counters (using PCL) Cache and TLB misses, pipeline stalls, heap allocations, synchronization messages Utilize run-time profiles as input to: –Provide user feedback (e.g., program visualization) –Perform profile-driven compilation (recompile using the profile) –Enable dynamic optimization (just-in-time compilation) –Evaluate software testing coverage

Profiling Tools Mercury Tools TATL – Trace Analysis Tool and Library Procedure profiles Gnu gprof PowerPC Performance Counters PCL – Performance Counter Library PM API – targeting the PowerPC Greenhills Compiler MULTI profiling support Custom instrumentation drivers

SAR Program profile counter values program paths variable values COMPILERCOMPILER Feedback Compile-time Optimizations Data Parallel Applications Program Binary Binary-level Optimizations Feedback GPR Software Defined Radio MRI Program run

Target Optimizations Compile-time –Aggressive procedure inlining –Aggressive constant propagation Program variable specialization Procedure cloning –Removal of redundant loads/stores Link-time –Code reordering utilizing coloring –Static data reordering Dynamic (during runtime) –Heap layout optimization

Memory Performance is Key to Scalability in Data-parallel applications The performance gap between processor technology and memory technology continues to grow Hierarchical memory systems (multi-level caches) have been used to bridge this gap Embedded processing applications place a heavy burden on the supporting memory system Applications will need to adapt (potentially dynamically) to better utilize the available memory system

Cache Line Coloring Attempts to reorder a program executable by coloring the cache space, avoiding caller-callee conflicts in a cache Can be driven by either statically-generated call graphs or profile data Improves upon the work of Pettis and Hansen by considering the organization of the cache space (i.e., cache size, line size, associativity) Can be used with different levels of granularity (procedures, basic blocks) and both intra- and inter- procedurally

Cache Line Coloring Algorithm Build program call graph –nodes represent procedures –edges represent calls –edge weight represent call frequencies Prune edges based on a threshold value Sort graph edges and process in decreasing edge weight order Place procedures in the cache space, avoiding color conflicts Fill in gaps with remaining procedures Reduces execution time by up to 49% for data compression algorithms A BE 90 40

Data Memory Access A disproportionate number of data cache misses are caused by accesses to dynamically allocated (heap) memory Increases in cache size do not effectively reduce data cache misses caused by heap accesses A small number of objects account for a large percentage of heap misses (90/10 rule) Existing memory allocation routines tend to balance allocation speed and memory usage (locality preservation has not been a major concern)

Miss rates (%) vs. Cache Configurations

Profile-driven Data Layout We have developed a profile-guided approach to allocating heap objects to improve heap behavior The idea is to use existing knowledge of the computing platform (e.g., cache organization), combined with profile data, to enable the target application to execute more efficiently Mapping temporally local memory blocks possessing high reference counts to the same cache area will generate a significant number of cache misses

Allocation We have developed our own malloc routine which uses a conflict profile to avoid allocating potentially conflicting addresses A multi-step allocation algorithm is repeated until a non-conflicting allocation is made If all steps produce conflicts, allocation is made within the wilderness region If conflicts still occur in the wilderness region, we allocate these conflicting chunks (creating a hole) Allocation occurs at the first non-conflicting address after the chunk The hole is immediately freed, causing minimal space wastage (though possibly some limited fragmentation)

Runtime improvements over non-optimized heap layout

Future Work Present algorithms have only been evaluated on uniprocessor platforms Follow-on work will target Mercury RACE multiprocessor systems Target applications will include: –FM3TR for Software Defined Radio –Steepest Decent Fast Multipole Methods (SDFMM) and Method for demining applications

 “Improving the Performance of Heap-based Memory Access,” E. Yardimci and D. Kaeli, Proc. of the Workshop on Memory Performance Issues, June  “Accurate Simulation and Evaluation of Code Reordering,” J. Kalamatianos and D. Kaeli, Proc. of the IEEE International Symposium on the Performance Analysis of Systems and Software, May  “`Model Based Parallel Programming with Profile-Guided Application Optimization,” J. Smith and D. Kaeli, Proc. of the 4th Annual High Performance Embedded Computing Workshop, MIT Lincoln Labs, Lexington, MA, September 2000, pp  “Cache Line Coloring Using Real and Estimated Profiles,” A. Hashemi, J. Kalamatianos, D. Kaeli and W. Meleis, Digital Technical Journal, Special Issues on Tools and Languages, February  `` Parameter Value Characterization of Windows NT-based Applications,'‘ J. Kalamatianos and D. Kaeli, Workload Characterization: Methodology and Case Studies, IEEE Computer Society, 1999, pp Related Publications

 “Analysis of Temporal-based Program Behavior for Improved Instruction Cache Performance,” J. Kalamatianos, A. Khalafi, H. Hashemi, D. Kaeli and W. Meleis, IEEE Transactions on Computers, Vol.10, No. 2, February 1999, pp  “Memory Architecture Dependent Program Mapping,” B. Calder, A. Hashemi, and D. Kaeli, US Patent No. 5,963,972, October 5,  “Temporal-based Procedure Reordering for Improved Instruction Cache Performance,” Proc. of the 4 th HPCA, Feb. 1998, pp  “Efficient Procedure Mapping Using Cache Line Coloring,” H. Hashemi, D. Kaeli and B. Calder, Proc. of PLDI’97, June 1997, pp  “Procedure Mapping Using Static Call Graph Estimation,” Proc. of the Workshop on the Interaction Between Compilers and Computer Architecture, TCCA News, Related Publications (also see