Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Slides:



Advertisements
Similar presentations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Advertisements

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Mitigating the Compiler Optimization Phase- Ordering Problem using Machine Learning.
CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Using one level of Cache:
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos Architecture and Language Implementation Lab Thesis Seminar University.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Efficient Model Selection for Support Vector Machines
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
CISC Machine Learning for Solving Systems Problems Arch Explorer Lecture 5 John Cavazos Dept of Computer & Information Sciences University of Delaware.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Ensemble Learning for Low-level Hardware-supported Malware Detection
Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
Floorplanning Optimization with Trajectory Piecewise-Linear Model for Pipelined Interconnects C. Long, L. J. Simonson, W. Liao and L. He EDA Lab, EE Dept.
CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Intelligent Compilation John Cavazos Computer & Information Sciences Department.
Workload Design: Selecting Representative Program-Input Pairs Lieven Eeckhout Hans Vandierendonck Koen De Bosschere Ghent University, Belgium PACT 2002,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos J Eliot B Moss Architecture and Language Implementation Lab University.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.
CS161 – Design and Architecture of Computer Systems
Ioannis E. Venetis Department of Computer Engineering and Informatics
Understanding Performance Counter Data - 1
Address-Value Delta (AVD) Prediction
CMSC 611: Advanced Computer Architecture
Reducing Training Time in a One-shot Machine Learning-based Compiler
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Presented by: Divya Muppaneni
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture 10: Branch Prediction and Instruction Delivery
CMSC 611: Advanced Computer Architecture
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware Rapidly Selecting Good Compiler Optimizations Using Performance Counters

Dept. of Computer and Information Sciences : University of Delaware Traditional Compilers ► “One size fits all” approach ► Tuned for average performance ► Aggressive opts often turned off ► Target hard to model analytically application runtime system compiler operating system hardware virtualization

Dept. of Computer and Information Sciences : University of Delaware Solution ► Use performance counter characterization ► Train model off-line ► Counter values are “features” of program ► Out-performs state-of-the-art compiler ► 2 orders of magnitude faster than pure search application runtime system compiler operating system hardware virtualization Performance Counter Information

Dept. of Computer and Information Sciences : University of Delaware Performance Counters ► 60 counters available ► 5 categories ► FP, Branch, L1 cache, L2 cache, TLB, Others ► Examples: Mnemonic Description Avg Values ► FPU_IDL (Floating Unit Idle) ► VEC_INS (Vector Instructions) ► BR_INS (Branch Instructions) ► L1_ICH (L1 Icache Hits)

Dept. of Computer and Information Sciences : University of Delaware Characterization of SPEC FP

Dept. of Computer and Information Sciences : University of Delaware Larger number of L1 icache misses, L1 store misses and L2 D-cache writes Characterization of SPEC FP

Dept. of Computer and Information Sciences : University of Delaware Characterization of MiBench Exercises the cache less than SPEC FP.

Dept. of Computer and Information Sciences : University of Delaware Characterization of MiBench More branches than SPEC FP and more are mispredictions.

Dept. of Computer and Information Sciences : University of Delaware Characterization of 181.mcf Problem: Greater number of memory accesses per instruction than average

Dept. of Computer and Information Sciences : University of Delaware Characterization of 181.mcf Problem: BUT also Branch Instructions

Dept. of Computer and Information Sciences : University of Delaware Characterization of 181.mcf Reduce total/branch instructions and L1 I-cache and D-cache accesses.

Dept. of Computer and Information Sciences : University of Delaware Characterization of 181.mcf Model reduces L1 cache misses which reduces L2 cache accesses.

Dept. of Computer and Information Sciences : University of Delaware Putting Perf Counters to Use ► Important aspects of programs captured with performance counters ► Automatically construct model (PC Model)‏ ► Map performance counters to good opts ► Model predicts optimizations to apply ► Uses performance counter characterization

Dept. of Computer and Information Sciences : University of Delaware Training PC Model Compiler and

Dept. of Computer and Information Sciences : University of Delaware Programs to train model (different from test program). Compiler and Training PC Model

Dept. of Computer and Information Sciences : University of Delaware Baseline runs to capture performance counter values. Compiler and Training PC Model

Dept. of Computer and Information Sciences : University of Delaware Obtain performance counter values for a benchmark. Compiler and Training PC Model

Dept. of Computer and Information Sciences : University of Delaware Best optimizations runs to get speedup values. Compiler and Training PC Model

Dept. of Computer and Information Sciences : University of Delaware Best optimizations runs to get speedup values. Compiler and Training PC Model

Dept. of Computer and Information Sciences : University of Delaware New program interested in obtaining good performance. Compiler and Using PC Model

Dept. of Computer and Information Sciences : University of Delaware Baseline run to capture performance counter values. Compiler and Using PC Model

Dept. of Computer and Information Sciences : University of Delaware Feed performance counter values to model. Compiler and Using PC Model

Dept. of Computer and Information Sciences : University of Delaware Model outputs a distribution that is use to generate sequences Compiler and Using PC Model

Dept. of Computer and Information Sciences : University of Delaware Optimization sequences drawn from distribution. Compiler and Using PC Model

Dept. of Computer and Information Sciences : University of Delaware ► Trained on data from Random Search ► 500 evaluations for each benchmark ► Leave-one-out cross validation ► Training on N-1 benchmarks ► Test on Nth benchmark ► Logistic Regression PC Model

Dept. of Computer and Information Sciences : University of Delaware ► Variation of ordinary regression ► Inputs ► Continuous, discrete, or a mix ► 60 performance counters ► All normalized to cycles executed ► Ouputs ► Restricted to two values (0,1)‏ ► Probability an optimization is beneficial Logistic Regression

Dept. of Computer and Information Sciences : University of Delaware ► PathScale compiler ► Compare to highest optimization level ► 121 compiler flags ► AMD Athlon processor ► Real machine; Not simulation ► 57 benchmarks ► SPEC (INT 95, INT/FP 2000), MiBench, Polyhedral Experimental Methodology

Dept. of Computer and Information Sciences : University of Delaware ► RAND ► Randomly select 500 optimization seqs ► Combined Elimination [CGO 2006] ► Pure search technique ► Evaluate optimizations one at a time ► Eliminate negative optimizations in one go ► Out-performed other pure search techniques ► PC Model Evaluated Search Strategies

Dept. of Computer and Information Sciences : University of Delaware PC Model vs CE (MiBench/Polyhedral)

Dept. of Computer and Information Sciences : University of Delaware 1. 9 benchmarks over 20% improvement and 17% on average! 2. CE uses 607 iterations ( ) and PC Model 25 iterations. PC Model vs CE (MiBench/Polyhedral)

Dept. of Computer and Information Sciences : University of Delaware PC Model vs CE (SPEC 95/SPEC 2000)

Dept. of Computer and Information Sciences : University of Delaware 1. Obtain over 25% improvement on 7 benchmarks! 2. On average, CE obtains 9% and PC Model 17% over -ofast! PC Model vs CE (SPEC 95/SPEC 2000)

Dept. of Computer and Information Sciences : University of Delaware Performance vs Evaluations

Dept. of Computer and Information Sciences : University of Delaware Random (17%)‏ Combined Elimination (12%)‏ PC Model (17%)‏ Performance vs Evaluations

Dept. of Computer and Information Sciences : University of Delaware ► Combined Elimination ► Dependent on dimensions of space ► Easily stuck in local minima ► RAND ► Probabilistic technique ► Depends on distribution of good points ► Not susceptible to local minima Note: CE may improve in space with many bad opts. Why is CE worse than RAND?

Dept. of Computer and Information Sciences : University of Delaware ► Characterizing large programs hard ► Performance counters effectively summarize program's dynamic behavior ► Previously* used static features [CGO 2006] ► Does not work for whole program characterization Program Characterization

Dept. of Computer and Information Sciences : University of Delaware ► Use performance counters to find good optimization settings ► Out-performs production compiler in few evaluations (+ 3 for counters)‏ ► 2 orders of magnitude faster than best known pure search technique Conclusions

Dept. of Computer and Information Sciences : University of Delaware Backup Slides

Dept. of Computer and Information Sciences : University of Delaware Static vs Dynamic Features

Dept. of Computer and Information Sciences : University of Delaware 1. L1 Cache Accesses 2. L1 Dcache Hits 3. TLB Data Misses 4. Branch Instructions 5. Resource Stalls 6. Total Cycles 7. L2 Icache Hits 8. Vector Instructions 9. L2 Dcache Hits 10. L2 Cache Accesses 11. L1 Dcache Accesses 12. Hardware Interrupts 13. L2 Cache Hits 14. L1 Cache Hits 15. Branch Misses Most Informative Performance Counters Most Informative Features