Improving Compiler Heuristics with Machine Learning

Slides:

Advertisements

Similar presentations

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Tuesday, May 14 Genetic Algorithms Handouts: Lecture Notes Question: when should there be an additional review session?

21-May-15 Genetic Algorithms. 2 Evolution Here’s a very oversimplified description of how evolution works in biology Organisms (animals or plants) produce.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

ALG0183 Algorithms & Data Structures Lecture 4 Experimental Algorithmics 8/25/20091 ALG0183 Algorithms & Data Structures by Dr Andy Brooks Case study article:

Learning by Simulating Evolution Artificial Intelligence CSMC February 21, 2002.

Improving Compiler Heuristics with Machine Learning Mark Stephenson Una-May O’Reilly Martin C. Martin Saman Amarasinghe Massachusetts Institute of Technology.

Optimization Problems

Genetic Algorithms An Evolutionary Approach to Problem Solving.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

1 Genetic Algorithms Contents 1. Basic Concepts 2. Algorithm 3. Practical considerations.

Genetic Algorithms 11-Oct-17.

Genetic Algorithms.

Optimization Problems

Advanced Algorithms Analysis and Design

Rule Induction for Classification Using

Balancing of Parallel Two-Sided Assembly Lines via a GA based Approach

Evolution strategies and genetic programming

Optimizing Compilers Background

The compilation process

For Monday Chapter 6 Homework: Chapter 3, exercise 7.

Scheduling: The Multi-Level Feedback Queue

Subject Name: File Structures

A Closer Look at Instruction Set Architectures

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

Title of your science project

CSC 380: Design and Analysis of Algorithms

The Study of Life Chapter 1.

Instruction Scheduling for Instruction-Level Parallelism

Improving Compiler Heuristics with Machine Learning

Lecture 2: Processes Part 1

Learning to Program in Python

Central Processing Unit

Heuristics Definition – a heuristic is an inexact algorithm that is based on intuitive and plausible arguments which are “likely” to lead to reasonable.

Example: Applying EC to the TSP Problem

Objective of This Course

Optimization Problems

Genetic Programming Applied to Compiler Optimization

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Genetic Algorithms CSCI-2300 Introduction to Algorithms

4. Computational Problem Solving

Trace-based Just-in-Time Type Specialization for Dynamic Languages

Lecture 16: Register Allocation

Boltzmann Machine (BM) (§6.4)

Genetic Algorithms 25-Feb-19.

ECE 352 Digital System Fundamentals

Genetic Programming Applied to Compiler Optimization

Chapter 12 Pipelining and RISC

Machine Learning: UNIT-4 CHAPTER-2

Data Structures & Algorithms

ENGG*6140 Optimization for Engineering

Artificial Intelligence CIS 342

Genetic Algorithms 26-Apr-19.

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

CSC 380: Design and Analysis of Algorithms

Coevolutionary Automated Software Correction

Presentation transcript:

Improving Compiler Heuristics with Machine Learning Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer Architecture Group http://www.cag.lcs.mit.edu/metaopt

Motivation Compiler writers are faced with many challenges: Many compiler problems are NP-hard Modern architectures are inextricably complex Simple models can’t capture architecture intricacies Micro-architectures change quickly Compiler writers have a difficult job. First of all, many compiler problems are NP-hard. This is a problem since compilers are expected to run in a reasonable amount of time. Even when simple problems exist to NP-hard solutions, they take an unreasonable amount of time to compute. Furthermore, modern architectures – the target of compilation – are becoming increasingly complicated. The simple models that compiler writers use don’t capture all of the architecture’s intricacies. Another problem is that microarchitectures change quickly. Even though the ISA may not change, the compiler has to be re-tuned to match the architecture. http://www.cag.lcs.mit.edu/metaopt

Motivation Heuristics alleviate complexity woes Unfortunately… Find good approximate solutions for a large class of applications Find solutions quickly Unfortunately… They require a lot of trial-and-error tweaking to achieve suitable performance Fortunately for compiler writers, heuristics alleviate a lot of the aforementioned problems. In practice they find good approximate solutions for a large class of applications. And just as importantly, they find the solutions quickly. The big problem with heuristics is that they require a lot of tweaking in order to achieve a suitable level of performance. http://www.cag.lcs.mit.edu/metaopt

Priority Functions A heuristic’s Achilles heel A single priority or cost function often dictates the efficacy of a heuristic Priority functions rank the options available to a compiler heuristic Graph coloring register allocation (selecting nodes to spill) List scheduling (identifying instructions in worklist to schedule first) Hyperblock formation (selecting paths to include) A key insight that enables our research is that heuristics often have an Achilles heel. A single priority function may dictate the efficacy of a heuristic. This is what compiler writers spend so long tweaking. Here are a few examples of how priority functions are used. In graph coloring register allocation, a priority function is used to select nodes (or variables) to spill if spilling is required. A list scheduler uses a priority function to decide which instructions in the ready list to schedule first. And as you’ll see later in this talk, a well-know hyperblock formation algorithm uses priority functions to determine which paths to merge into a single predicated hyperblock. http://www.cag.lcs.mit.edu/metaopt

Machine Learning We propose using machine learning techniques to automatically search the priority function space Search space is feasible Make use of spare computer cycles So here’s our proposal: let’s use machine learning techniques to automatically search the priority function solution space. Since we limit our search to priority functions, the search space size is feasible. In other words, we’re not trying to find completely new heuristics. We want to work within established frameworks and try to find priority functions. This is a good way to make use of spare computer cycles. http://www.cag.lcs.mit.edu/metaopt

Case Study I: Hyperblock Formation Find predicatable regions of control flow Enumerate paths of control in region Exponential, but in practice it’s okay Prioritize paths based on several characteristics The priority function we want to optimize Add paths to hyperblock in priority order Here’s an example of a heuristic that uses a priority function. This example comes from Trimaran’s IMPACT compiler’s hyperblock formation algorithm. By the way, we use Trimaran to collect all the results in this presentation. Anyway, the algorithm first identifies predicatable regions (such as if-then-else statements). It then enumerates the paths of control through the region. It then uses a priority function to rank the paths based on program characteristics. This is the priority function that we want to find. The algorithm then blindly merges these paths in priority creating a predicated hyperblock (it does this until machine resources are used up). Don’t worry if you don’t understand exactly how this algorithm works. The take home message is that there’s a single priority function that controls this algorithm. All told, Trimaran is close to 1 million lines of code, and if you change a single line of code, you can get up to 3x improvements for certain benchmarks. http://www.cag.lcs.mit.edu/metaopt

Case Study I: IMPACT’s Function Favor frequently Executed paths Favor short paths Here’s IMPACT’s priority function for hyperblock formation (See Scott Mahlke’s thesis for details). It favors frequently executed paths, penalizes paths with hazards, favors short paths, and favors parallel paths. It seems to make sense. But are these the only characteristics that are important to hyperblock formation? Penalize paths with hazards Favor parallel paths http://www.cag.lcs.mit.edu/metaopt

Hyperblock Formation What are the important characteristic of a hyperblock formation priority function? IMPACT uses four characteristics Extract all the characteristics you can think of and have a machine learning algorithm find the priority function IMPACT only uses four characteristics. Are there other characteristics that might be useful in creating good priority functions? We would like to extract all the characteristics we can think of and feed them to a machine learning algorithm. Let’s let a machine find a good priority function for us. http://www.cag.lcs.mit.edu/metaopt

Hyperblock Formation x1 Maximum ops over paths x2 Dependence height x3 Number of paths x4 Number of operations x5 Does path have subroutine calls? x6 Number of branches x7 Does path have unsafe calls? x8 Path execution ratio x9 Does path have pointer derefs? x10 Average ops executed in path x11 Issue width of processor x12 Average predictability of branches in path … xN Predictability product of branches in path So that’s what we do. We extract a bunch of program characteristics and feed it to a machine learning algorithm. We replace Trimaran’s static priority function with one that our learning algorithm learned. http://www.cag.lcs.mit.edu/metaopt

Genetic Programming GP’s representation is a directly executable expression Basically a lisp expression (or an AST) In our case, GP variables are interesting characteristics of the program num_ops 2.3 predictability 4.1 - / * While there are many machine-learning techniques out there, we chose to use genetic programming because it suits are needs well. Like priority functions, GP’s representation is a directly executable expression (a lisp expression). The GP variables (not to be confused with program variables) are program characteristics that we think might be useful in creating an effective priority function. http://www.cag.lcs.mit.edu/metaopt

Genetic Programming Searching algorithm analogous to natural selection Maintain a population of expressions Selection The fittest expressions in the population are more likely to reproduce Sexual reproduction Crossing over subexpressions of two expressions Mutation We don’t want to take the analogy too far, but GP is a searching algorithm akin to natural selection. It maintains a population of expressions (in our case 400). Just as with natural selection, the fittest expressions in the population are more likely to reproduce and send their ‘genes’ to the next generation. It features a reproduction operation that creates two new expressions with two existing expressions. There’s also a mutation operation to introduce diversity into a possibly stagnant population. Unfortunately, I don’t have more time to talk about GP in this talk. The theory is, expressions should ‘evolve’ such that the overall fitness of the population increases every generation. http://www.cag.lcs.mit.edu/metaopt

Genetic Programming Create initial population (initial solutions) Most expressions in initial population are randomly generated It also seeded with the compiler writer’s best guesses Evaluation Generation of variants (mutation and crossover) Selection Here’s the flow of genetic programming. Our initial population is mostly random (399/400 expressions). We do however seed it with the priority function that comes with Trimaran. Generations < Limit? END http://www.cag.lcs.mit.edu/metaopt

Genetic Programming Create initial population (initial solutions) Each expression is evaluated by compiling and running benchmark(s) Fitness is the relative speedup over the baseline on benchmark(s) Create initial population (initial solutions) Evaluation Baseline expression in the one that’s distributed with Trimaran Generation of variants (mutation and crossover) Selection The algorithm then evaluates each expression in the population. It evaluates an expression by compiling and running the benchmark(s) in our ‘test’ suite. It then assigns a fitness to each expression based on the relative speedup over the baseline priority function – the one that’s distributed with Trimaran. Generations < Limit? END http://www.cag.lcs.mit.edu/metaopt

Genetic Programming Just as with Natural Selection, the fittest individuals are more likely to survive and reproduce. Create initial population (initial solutions) Evaluation Generation of variants (mutation and crossover) Selection The selection phase simply sorts the expression by fitness. Generations < Limit? END http://www.cag.lcs.mit.edu/metaopt

Genetic Programming Create initial population (initial solutions) Evaluation Generation of variants (mutation and crossover) Selection If the algorithm hasn’t reached a user-defined limit on the number of ‘generations’, the algorithm continues… Generations < Limit? END http://www.cag.lcs.mit.edu/metaopt

Genetic Programming Use crossover and mutation to generate new expressions Create initial population (initial solutions) Evaluation Generation of variants (mutation and crossover) Selection With the existing population, the algorithm creates a new generation. It does so via crossover (the analogy of sexual reproduction) and mutation. With the exception of the best expression, any expression in the population can be replaced (referred to as an elitist policy). Generations < Limit? END http://www.cag.lcs.mit.edu/metaopt

Hyperblock Results Compiler Specialization 3.5 Train data set Alternate data set 3 (add (sub (cmul (gt (cmul $b0 0.8982 $d17)…$d7)) (cmul $b0 0.6183 $d28))) 2.5 (add (div $d20 $d5) (tern $b2 $d0 $d9)) 2 Speedup 1.5 1.54 1.23 1 Here are some of the results that we have collected. This graph shows the results of ‘specializing’ a compiler for a given application. It’s basically a limit study, the best one could hope to get out of a general purpose priority function. In other words, we ‘train’ the priority function using one benchmark. For each benchmark we try to find a unique priority function that maximizes performance for it. Using the same input data set that we used to train each benchmark, the overall speedup that we get is 54%. This is unfair though, since it essentially partially evaluates the benchmark. If we apply an alternate data set to each benchmark, we get a 23% improvement. But note that hyperblock selection is very susceptible to variations in input data. 0.5 toast huff_dec huff_enc Average mpeg2dec g721encode g721decode rawcaudio rawdaudio 129.compress http://www.cag.lcs.mit.edu/metaopt

Hyperblock Results A General Purpose Priority Function Here we try to find a single general-purpose priority function that works well for all the benchmarks in this graph. In machine-learning terms, all of the benchmarks in this graph comprise our ‘training set’. Once again, the dark bar represents the speedups when applying the same data sets to the benchmarks that were used to train the priority function. When applying an alternate data set (that the GP algorithm hasn’t seen yet), we get a 25% improvement. This is still a somewhat unfair comparison. A compiler has to work well for a wide variety of benchmarks, not just the 12 shown in this slide. Though it may be a useful (but skanky!) way for compiler writers to improve their spec numbers. Just throw all the spec benchmarks into the training set! http://www.cag.lcs.mit.edu/metaopt

Cross Validation Testing General Purpose Applicability To really test the general applicability of the priority function, we have to apply it to benchmarks that the GP algorithm hasn’t seen before. The machine learning community refers to this as cross validation. Applying this priority function to all the benchmarks in this graph we get a 9% improvement. We achieve speedups on all the benchmarks in this ‘test’ set except for two: unepic and 085.cc1. http://www.cag.lcs.mit.edu/metaopt

Case Study II: Register Allocation A General Purpose Priority Function Here are some results for register allocation, which is a well studied optimization (See Hennessy and Chow). The benchmarks in this slide were used to find a single priority function which maximized (normalized) performance across them. Even for this well studied optimization, we found a 3% improvement. And when we apply a alternate data set, the improvement stays at 3%. This implies that register allocation is not as susceptible to variations in input as hyperblock formation is. http://www.cag.lcs.mit.edu/metaopt

Register Allocation Results Cross Validation When we apply the priority function to an unrelated ‘test’ set, we see that the performance gains remain. For a 32-register machine, on which we trained the priority function, we get a 2% improvement. When we apply the priority function to these benchmarks on a 64-register machine, the speedup is 3%. This suggests that the register allocation priority function is stable across benchmarks, input sets, and even architectural variations. http://www.cag.lcs.mit.edu/metaopt

Conclusion Machine learning techniques can identify effective priority functions ‘Proof of concept’ by evolving two well known priority functions Human cycles v. computer cycles In conclusion, we provide ‘proof of concept’ here that priority functions can be learned by machines. Why not trade off human cycles for machine cycles? http://www.cag.lcs.mit.edu/metaopt

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Intron that doesn’t affect solution SAFTEY SLIDES FOLLOW: The general purpose priority function for hyperblock formation. http://www.cag.lcs.mit.edu/metaopt

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor paths that don’t have pointer dereferences http://www.cag.lcs.mit.edu/metaopt

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor highly parallel (fat) paths http://www.cag.lcs.mit.edu/metaopt

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean 0.8720) 0.9400) (mul 0.4762 (cmul (not has_pointer_deref) (mul 0.6727 num_paths) (mul 1.1609 (add (sub (mul (div num_ops dependence_height) 10.8240) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean 0.9838) (sub 1.1039 num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) If a path calls a subroutine that may have side effects, penalize it http://www.cag.lcs.mit.edu/metaopt

Case Study I: IMPACT’s Algorithm 4k 24k Path exec haz ops dep pr A-B-D-F-G ~0 1.0 13 4 A-B-F-G 0.14 10 0.21 A-C-F-G 0.79 9 2 1.44 A-C-E-F-G 0.07 0.25 5 0.02 A-C-E-G 11 3 B C 10 4k 22k 2k E D 2k 10 25 Detailed example of IMPACT’s hyperblock formation algorithm. F 28k G 28k http://www.cag.lcs.mit.edu/metaopt

Case Study I: IMPACT’s Algorithm 4k 24k Path exec haz ops dep pr A-B-D-F-G ~0 1.0 13 4 A-B-F-G 0.14 10 0.21 A-C-F-G 0.79 9 2 1.44 A-C-E-F-G 0.07 0.25 5 0.02 A-C-E-G 11 3 B C 10 4k 22k 2k E D 2k 10 25 The priority function determines which paths to merge into a predicated hyperblock. F 28k G 28k http://www.cag.lcs.mit.edu/metaopt