Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

SE-292 High Performance Computing Profiling and Performance R. Govindarajan
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
SE-292 High Performance Computing Profiling and Performance R. Govindarajan
1 Lecture 6 Performance Measurement and Improvement.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 Feedback-directed optimizations with estimated edge profiles from hardware event sampling Open64 workshop, CGO 2008 April 6, 2008 Vinodha Ramasamy, Robert.
CH12 CPU Structure and Function
Rapid Identification of Architectural Bottlenecks via Precise Event Counting John Demme, Simha Sethumadhavan Columbia University
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology, Fall 2010 Performance.
Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Performance of mathematical software Agner Fog Technical University of Denmark
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
IBM Haifa Labs © 2005 IBM Corporation Performance Tools developed in IBM Haifa Gad Haber
Timers and Interrupts Anurag Dwivedi. Let Us Revise.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
1 Computer Architecture. 2 Basic Elements Processor Main Memory –volatile –referred to as real memory or primary memory I/O modules –secondary memory.
Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
Computer Organization CS224
CS203 – Advanced Computer Architecture
Selective Code Compression Scheme for Embedded System
CS203 – Advanced Computer Architecture
Performance monitoring on HP Alpha using DCPI
Pipelining: Advanced ILP
CSCI1600: Embedded and Real Time Software
Feedback directed optimization in Compaq’s compilation tools for Alpha
Understanding Performance Counter Data - 1
Ann Gordon-Ross and Frank Vahid*
Adaptive Optimization in the Jalapeño JVM
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
How to improve (decrease) CPI
CSCI1600: Embedded and Real Time Software
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy Paul Yuan (Peking University) Wenguang Chen, Weimin Zheng (Tsinghua University) 1

Why FDO? 2 Feedback Directed Optimization Performance Improvements – 5% speedup on SPEC2000 INT – Small? Huge for millions of computers Not widely adopted

Instrumentation based FDO gcc –fprofile-generate … Instrumented Binary Representative Workload Run the instrumented binary.gcda files gcc –fprofile-use … FDO optimized binary Have to build twice 2.Instrumentation run is slow 3.Need representative input 4.Perturbs execution 3

Sample based FDO Running Environment gcc –O2 -g … Normal Binary Real-World Workload Profile Data gcc –fsample-profile … FDO optimized binary Previous deployment/test binary to collect profile 2.Profiling input: real traffic 3.Profiling does not perturb code Profiling Tools (Oprofile, Pfmon) Profiling Tools (Oprofile, Pfmon) 4

PMU Sampling Performance monitoring unit (PMU) o Captures events generated by CPU  cache miss  instruction retired  clock tick o Configurable counters increment on selected events o Optional interrupt on counter overflow Sampling o On interrupt capture instruction pointer (IP) o Can also sample other state  registers  other PMU counters 5

Sampling Instructions Retired Instruction Samples x76a1f x76a48a x76aea x76c09d 1 0x77e3cf 733 0x77ee7e x78109d Symbolized Samples 0x76a1f4 : 1499 foo.c:1183 0x76a48a : 1517 foo.c:992 0x76aea1 : 1489 foo.c:906 0x76c09d : 1527 foo.h: x77e3cf : 1 bar.c:3481 0x77ee7e : 733 bar.c x78109d : 1242 bar.c 4762 Symbolizer GCC foo.c:853 foo.c:906 foo.c:992 foo.c:1183 foo.c:

Sampling Instructions Retired Instruction Samples x76a1f x76a48a x76aea x76c09d 1 0x77e3cf 733 0x77ee7e x78109d Symbolized Samples 0x76a1f4 : 1499 foo.c:1183 0x76a48a : 1517 foo.c:992 0x76aea1 : 1489 foo.c:906 0x76c09d : 1527 foo.h: x77e3cf : 1 bar.c:3481 0x77ee7e : 733 bar.c x78109d : 1242 bar.c 4762 Symbolizer GCC foo.c:853 foo.c:906 foo.c:992 foo.c:1183 foo.c: [Levin et.al. Complementing Missing and Inaccurate Profiling using a Minimum Cost Circulation Algorithm. HIPEAC’08]

Accuracy Challenge 8

9

10

Improving the Accuracy Flow Consistency – Use Minimum Cost Circulation Algorithm – Control flow  Network flow Predict Aggregation/Shadow Effect – Sampling Multiple Events – Using the prediction to adjust the frequency 11

Branch: 7954 Taken: 7922 Join: 1049 Source Sink Source Sink 12 [Levin et.al. Complementing Missing and Inaccurate Profiling using a Minimum Cost Circulation Algorithm. HIPEAC’08]

Use Prediction to Adjust Profile Cost Function in MCC – Each basic block is attached by two edges Forward (flow represents increasing the count) Backward (flow represents decreasing the count) – Cost function for each edge Larger cost means prevent changing in this direction Using the prediction – Over-sampled: high cost on forward edge – Under-sampled: high cost on backward edge BB1’ BB1’’ Forward EdgeBackward Edge 13

Predict Aggregation/Shadow Model Aggregation Effects – Long latency instructions – Sample major long latency events Branch Mispredict, Cache/DTLB Miss, etc Estimate the stalls these events will cause Skid has little influence on long latency events 14

Predict Aggregation/Shadow Model Shadow Effects – CPU_CORE_CYCLES event Time based sampling Skid will only shift the profile CPU_CYCLE – INST_RETIRED  Stalled Cycle (with skid) Each stalled cycle will set a shadow area Aggregation and Shadow co-exist – Heuristic to check which one dominates 15

Evaluation: Accuracy 16

Evaluation: Performance 17

Conclusion and Future Work Sampling based FDO is promising The artifacts in PMU data can be compensated for with appropriate understanding and heuristics, which improves the accuracy by 6% Sample based Value Profiling Future: Last Branch Register  More precise edge profile at binary level Sample based LIPO 18

Questions? 19