Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
1 Enterprise Platforms Group Pinpointing Representative Portions of Large Intel Itanium Programs with Dynamic Instrumentation Harish Patil, Robert Cohn,
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Quiz 4 Solution. n Frequency = 2.5GHz, CLK = 0.4ns n CPI = 0.4, 30% loads and stores, n L1 hit =0, n L1-ICACHE : 2% miss rate, 32-byte blocks n L1-DCACHE.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott.
Memory System Characterization of Big Data Workloads
Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”
Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,
Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.
Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.
DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers Wei Chung Hsu Computer Science and Engineering Department University of Minnesota, Twin Cities.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.
Srihari Makineni & Ravi Iyer Communications Technology Lab
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
IMP: Indirect Memory Prefetcher
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Simone Campanoni A research CAT Simone Campanoni
System/Networking performance analytics with perf
Multiscalar Processors
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Presented by: Isaac Martin
Feedback directed optimization in Compaq’s compilation tools for Alpha
Understanding Performance Counter Data - 1
Address-Value Delta (AVD) Prediction
CMSC 611: Advanced Computer Architecture
John-Paul Fryckman CSE 231: Paper Presentation 23 May 2002
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
CMSC 611: Advanced Computer Architecture
Phase based adaptive Branch predictor: Seeing the forest for the trees
Presentation transcript:

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design Center Intel Corporation

CGO’04 Tutorial2 What is Ispike? A post-link optimizer for Itanium/Linux –No source code required –Memory-centric optimizations: Code layout + prefetching, data layout + prefetching –Significant speedups over compiler-optimized programs: 10% average speedup over gcc –O3 on SPEC CINT 2000 Profile usages: –Understanding program characteristics –Driving optimizations automatically –Evaluating the effectiveness of optimizations

CGO’04 Tutorial3 Profiles used by Ispike GranularityHardware Profiles (pfmon) Instrumentation Profiles (pin) Usages Per inst.PC sample ---Identifying hot spots Per inst. lineI-EAR (I-Cache) ---Inst. prefetching I-EAR (I-TLB) --- Per branchBTBEdge profileCode layout, data layout, and other opts Per loadD-EAR (D-Cache)Load-latency profileData prefetching D-EAR (D-TLB)--- D-EAR (stride)Stride profileData prefetching

CGO’04 Tutorial4 Profile Example: D-EAR (cache) Top 10 loads in the D-EAR profile of the MCF benchmark latency buckets Total sampled miss latency

CGO’04 Tutorial5 Profile Analysis Tools A set of tools written for visualizing and analyzing profiles, e.g.,: –Control flow graph (CFG) viewer –Code-layout viewer –Load-latency comparator

CGO’04 Tutorial6 CFG Viewer  For evaluating the accuracy of profiles

CGO’04 Tutorial7 Code-layout Viewer  For evaluating code-layout optimization

CGO’04 Tutorial8 Load-latency Comparator  For evaluating data-layout optimization and data prefetching

CGO’04 Tutorial9 Deriving New Profiles from PMUs New profile types can be derived from PMUs Two examples: –Consumer stall cycles –D-cache miss strides

CGO’04 Tutorial10 Consumer Stall Cycles Question: –How many cycles of stall experienced by I2 ? (Note: not necessarily the load latency of I1 ) Method: –PC-sample count is proportional to (stall cycles * frequency) I1: ld8 r2 = [r3];; /* other instructions */ I2: add r2 = r2, 1;; I3: st8 [r3] = r2 PC-sample count N1N1 N2N2 N3N3 Basic block A

CGO’04 Tutorial11 D-cache Miss Strides Problem: –Detect strides that are statically unknown arc* arcin; node* tail; … while (arcin) { tail = arcin->tail; … arcin = tail->mark; } arcin tail -192B -120B Example: Two strided loads in MCF

CGO’04 Tutorial12 D-EAR based Stride Profiling Sample load misses with 2 phases: Time Skipping phases (1 sample per 1000 misses) Inspection phases (1 sample per miss) GCD(A 2 -A 1, A 3 -A 2 )=GCD(240,336)=48 GCD(A 3 -A 2, A 4 -A 3 )=GCD(336,144)=48 Use GCD to figure out strides from miss addresses: Time A1A1 A2A2 A3A3 A 2 -A 1 =5*48=240A 3 -A 2 =7*48=336 A4A4 A 4 -A 3 =3*48=144 A 1, A 2, A 3, A 4 are four consecutive miss addresses of a load. The load has a stride of 48 bytes.

CGO’04 Tutorial13 Performance Evaluation Instrumentation vs. PMU profiles: –Profiling overhead –Performance impact Ispike optimizations: –Code layout, instruction prefetching, data layout, data prefetching, inlining, global-data optimization, scalar optimizations Baseline compilers: –Intel Electron compiler (ecc), version 8.0 Beta, -O3 –GNU C compiler (gcc), version 3.2, -O3 Benchmarks: –SPEC CINT2000 (profiled with “training”, measured with “reference”) System: –1GHz Itanium 2, 16KB L1I/16KB L1D, 256KB L2, 3MB L3, 16GB memory –Red Hat Enterprise Linux AS with kernel

CGO’04 Tutorial14 Performance Gains with PMU Profiles Up to 40% gain Geo. means: 8.5% over Ecc and 9.9% over Gcc Ecc8.0 –O3 baseline Gcc3.2 –O3 baseline BTB (1 sample/10K branches), D-EAR cache (1 sample/100 load misses) D-EAR stride (1 sample /100 misses in skipping, 1 sample/miss in inspection)

CGO’04 Tutorial15 Cycle Breakdown (Ecc Baseline)  Help understand if individual optimizations are doing a good job

CGO’04 Tutorial16 PMU Profiling Overhead Overhead reduced from 58% to 23% when lowering the BTB sampling rate by 10x. Overhead reduced to 3% when lowering the D-EAR sampling rate by 10x.

CGO’04 Tutorial17 Instrumentation Profiling Overhead Why is the overhead so large? –Training runs are too short to amortize the dynamic compilation cost –Techniques like ephemeral instrumentation yet to be applied

CGO’04 Tutorial18 PMU vs. Instrumentation (Perf. Gains) PMU profiles can be as good as instrumentation profiles –Could be even better in some cases (e.g., mcf ) However, possible performance drops when samples are too sparse –E.g., gap and parser when Stride = profiling overhead >60x 59% 24% 3%

CGO’04 Tutorial19 Reference “Ispike: A Post-link Optimizer for the Intel Itanium Architecture”, by Luk et. al. In Proceedings of CGO’04.