Improving Compiler Heuristics with Machine Learning

Improving Compiler Heuristics with Machine Learning
Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer Architecture Group

Motivation Compiler writers are faced with many challenges:
Many compiler problems are NP-hard Modern architectures are inextricably complex Simple models can’t capture architecture intricacies Micro-architectures change quickly Compiler writers have a difficult job. First of all, many compiler problems are NP-hard. This is a problem since compilers are expected to run in a reasonable amount of time. Even when simple problems exist to NP-hard solutions, they take an unreasonable amount of time to compute. Furthermore, modern architectures – the target of compilation – are becoming increasingly complicated. The simple models that compiler writers use don’t capture all of the architecture’s intricacies. Another problem is that microarchitectures change quickly. Even though the ISA may not change, the compiler has to be re-tuned to match the architecture.

Motivation Heuristics alleviate complexity woes Unfortunately…
Find good approximate solutions for a large class of applications Find solutions quickly Unfortunately… They require a lot of trial-and-error tweaking to achieve suitable performance Fortunately for compiler writers, heuristics alleviate a lot of the aforementioned problems. In practice they find good approximate solutions for a large class of applications. And just as importantly, they find the solutions quickly. The big problem with heuristics is that they require a lot of tweaking in order to achieve a suitable level of performance.

Priority Functions A heuristic’s Achilles heel
A single priority or cost function often dictates the efficacy of a heuristic Priority functions rank the options available to a compiler heuristic Graph coloring register allocation (selecting nodes to spill) List scheduling (identifying instructions in worklist to schedule first) Hyperblock formation (selecting paths to include) A key insight that enables our research is that heuristics often have an Achilles heel. A single priority function may dictate the efficacy of a heuristic. This is what compiler writers spend so long tweaking. Here are a few examples of how priority functions are used. In graph coloring register allocation, a priority function is used to select nodes (or variables) to spill if spilling is required. A list scheduler uses a priority function to decide which instructions in the ready list to schedule first. And as you’ll see later in this talk, a well-know hyperblock formation algorithm uses priority functions to determine which paths to merge into a single predicated hyperblock.

Machine Learning We propose using machine learning techniques to automatically search the priority function space Search space is feasible Make use of spare computer cycles So here’s our proposal: let’s use machine learning techniques to automatically search the priority function solution space. Since we limit our search to priority functions, the search space size is feasible. In other words, we’re not trying to find completely new heuristics. We want to work within established frameworks and try to find priority functions. This is a good way to make use of spare computer cycles.

Case Study I: Hyperblock Formation
Find predicatable regions of control flow Enumerate paths of control in region Exponential, but in practice it’s okay Prioritize paths based on several characteristics The priority function we want to optimize Add paths to hyperblock in priority order Here’s an example of a heuristic that uses a priority function. This example comes from Trimaran’s IMPACT compiler’s hyperblock formation algorithm. By the way, we use Trimaran to collect all the results in this presentation. Anyway, the algorithm first identifies predicatable regions (such as if-then-else statements). It then enumerates the paths of control through the region. It then uses a priority function to rank the paths based on program characteristics. This is the priority function that we want to find. The algorithm then blindly merges these paths in priority creating a predicated hyperblock (it does this until machine resources are used up). Don’t worry if you don’t understand exactly how this algorithm works. The take home message is that there’s a single priority function that controls this algorithm. All told, Trimaran is close to 1 million lines of code, and if you change a single line of code, you can get up to 3x improvements for certain benchmarks.

Case Study I: IMPACT’s Function
Favor frequently Executed paths Favor short paths Here’s IMPACT’s priority function for hyperblock formation (See Scott Mahlke’s thesis for details). It favors frequently executed paths, penalizes paths with hazards, favors short paths, and favors parallel paths. It seems to make sense. But are these the only characteristics that are important to hyperblock formation? Penalize paths with hazards Favor parallel paths

Hyperblock Formation What are the important characteristic of a hyperblock formation priority function? IMPACT uses four characteristics Extract all the characteristics you can think of and have a machine learning algorithm find the priority function IMPACT only uses four characteristics. Are there other characteristics that might be useful in creating good priority functions? We would like to extract all the characteristics we can think of and feed them to a machine learning algorithm. Let’s let a machine find a good priority function for us.

Hyperblock Formation x1 Maximum ops over paths x2 Dependence height
x3 Number of paths x4 Number of operations x5 Does path have subroutine calls? x6 Number of branches x7 Does path have unsafe calls? x8 Path execution ratio x9 Does path have pointer derefs? x10 Average ops executed in path x11 Issue width of processor x12 Average predictability of branches in path … xN Predictability product of branches in path So that’s what we do. We extract a bunch of program characteristics and feed it to a machine learning algorithm. We replace Trimaran’s static priority function with one that our learning algorithm learned.

Genetic Programming GP’s representation is a directly executable expression Basically a lisp expression (or an AST) In our case, GP variables are interesting characteristics of the program num_ops 2.3 predictability 4.1 - / * While there are many machine-learning techniques out there, we chose to use genetic programming because it suits are needs well. Like priority functions, GP’s representation is a directly executable expression (a lisp expression). The GP variables (not to be confused with program variables) are program characteristics that we think might be useful in creating an effective priority function.

Genetic Programming Searching algorithm analogous to natural selection
Maintain a population of expressions Selection The fittest expressions in the population are more likely to reproduce Sexual reproduction Crossing over subexpressions of two expressions Mutation We don’t want to take the analogy too far, but GP is a searching algorithm akin to natural selection. It maintains a population of expressions (in our case 400). Just as with natural selection, the fittest expressions in the population are more likely to reproduce and send their ‘genes’ to the next generation. It features a reproduction operation that creates two new expressions with two existing expressions. There’s also a mutation operation to introduce diversity into a possibly stagnant population. Unfortunately, I don’t have more time to talk about GP in this talk. The theory is, expressions should ‘evolve’ such that the overall fitness of the population increases every generation.

Genetic Programming Create initial population (initial solutions)
Most expressions in initial population are randomly generated It also seeded with the compiler writer’s best guesses Evaluation Generation of variants (mutation and crossover) Selection Here’s the flow of genetic programming. Our initial population is mostly random (399/400 expressions). We do however seed it with the priority function that comes with Trimaran. Generations < Limit? END

Each expression is evaluated by compiling and running benchmark(s) Fitness is the relative speedup over the baseline on benchmark(s) Create initial population (initial solutions) Evaluation Baseline expression in the one that’s distributed with Trimaran Generation of variants (mutation and crossover) Selection The algorithm then evaluates each expression in the population. It evaluates an expression by compiling and running the benchmark(s) in our ‘test’ suite. It then assigns a fitness to each expression based on the relative speedup over the baseline priority function – the one that’s distributed with Trimaran. Generations < Limit? END

Genetic Programming Just as with Natural Selection, the fittest individuals are more likely to survive and reproduce. Create initial population (initial solutions) Evaluation Generation of variants (mutation and crossover) Selection The selection phase simply sorts the expression by fitness. Generations < Limit? END

Evaluation Generation of variants (mutation and crossover) Selection If the algorithm hasn’t reached a user-defined limit on the number of ‘generations’, the algorithm continues… Generations < Limit? END

Genetic Programming Use crossover and mutation to generate new expressions Create initial population (initial solutions) Evaluation Generation of variants (mutation and crossover) Selection With the existing population, the algorithm creates a new generation. It does so via crossover (the analogy of sexual reproduction) and mutation. With the exception of the best expression, any expression in the population can be replaced (referred to as an elitist policy). Generations < Limit? END

Hyperblock Results Compiler Specialization
3.5 Train data set Alternate data set 3 (add (sub (cmul (gt (cmul $b $d17)…$d7)) (cmul $b $d28))) 2.5 (add (div $d20 $d5) (tern $b2 $d0 $d9)) 2 Speedup 1.5 1.54 1.23 1 Here are some of the results that we have collected. This graph shows the results of ‘specializing’ a compiler for a given application. It’s basically a limit study, the best one could hope to get out of a general purpose priority function. In other words, we ‘train’ the priority function using one benchmark. For each benchmark we try to find a unique priority function that maximizes performance for it. Using the same input data set that we used to train each benchmark, the overall speedup that we get is 54%. This is unfair though, since it essentially partially evaluates the benchmark. If we apply an alternate data set to each benchmark, we get a 23% improvement. But note that hyperblock selection is very susceptible to variations in input data. 0.5 toast huff_dec huff_enc Average mpeg2dec g721encode g721decode rawcaudio rawdaudio 129.compress

Hyperblock Results A General Purpose Priority Function
Here we try to find a single general-purpose priority function that works well for all the benchmarks in this graph. In machine-learning terms, all of the benchmarks in this graph comprise our ‘training set’. Once again, the dark bar represents the speedups when applying the same data sets to the benchmarks that were used to train the priority function. When applying an alternate data set (that the GP algorithm hasn’t seen yet), we get a 25% improvement. This is still a somewhat unfair comparison. A compiler has to work well for a wide variety of benchmarks, not just the 12 shown in this slide. Though it may be a useful (but skanky!) way for compiler writers to improve their spec numbers. Just throw all the spec benchmarks into the training set!

Cross Validation Testing General Purpose Applicability
To really test the general applicability of the priority function, we have to apply it to benchmarks that the GP algorithm hasn’t seen before. The machine learning community refers to this as cross validation. Applying this priority function to all the benchmarks in this graph we get a 9% improvement. We achieve speedups on all the benchmarks in this ‘test’ set except for two: unepic and 085.cc1.

Case Study II: Register Allocation A General Purpose Priority Function
Here are some results for register allocation, which is a well studied optimization (See Hennessy and Chow). The benchmarks in this slide were used to find a single priority function which maximized (normalized) performance across them. Even for this well studied optimization, we found a 3% improvement. And when we apply a alternate data set, the improvement stays at 3%. This implies that register allocation is not as susceptible to variations in input as hyperblock formation is.

Register Allocation Results Cross Validation
When we apply the priority function to an unrelated ‘test’ set, we see that the performance gains remain. For a 32-register machine, on which we trained the priority function, we get a 2% improvement. When we apply the priority function to these benchmarks on a 64-register machine, the speedup is 3%. This suggests that the register allocation priority function is stable across benchmarks, input sets, and even architectural variations.

Conclusion Machine learning techniques can identify effective priority functions ‘Proof of concept’ by evolving two well known priority functions Human cycles v. computer cycles In conclusion, we provide ‘proof of concept’ here that priority functions can be learned by machines. Why not trade off human cycles for machine cycles?

GP Hyperblock Solutions General Purpose
(add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Intron that doesn’t affect solution SAFTEY SLIDES FOLLOW: The general purpose priority function for hyperblock formation.

(add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor paths that don’t have pointer dereferences

(add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor highly parallel (fat) paths

(add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) If a path calls a subroutine that may have side effects, penalize it

Case Study I: IMPACT’s Algorithm
4k 24k Path exec haz ops dep pr A-B-D-F-G ~0 1.0 13 4 A-B-F-G 0.14 10 0.21 A-C-F-G 0.79 9 2 1.44 A-C-E-F-G 0.07 0.25 5 0.02 A-C-E-G 11 3 B C 10 4k 22k 2k E D 2k 10 25 Detailed example of IMPACT’s hyperblock formation algorithm. F 28k G 28k

Case Study I: IMPACT’s Algorithm
4k 24k Path exec haz ops dep pr A-B-D-F-G ~0 1.0 13 4 A-B-F-G 0.14 10 0.21 A-C-F-G 0.79 9 2 1.44 A-C-E-F-G 0.07 0.25 5 0.02 A-C-E-G 11 3 B C 10 4k 22k 2k E D 2k 10 25 The priority function determines which paths to merge into a predicated hyperblock. F 28k G 28k

Improving Compiler Heuristics with Machine Learning

Similar presentations

Presentation on theme: "Improving Compiler Heuristics with Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Compiler Heuristics with Machine Learning

Similar presentations

Presentation on theme: "Improving Compiler Heuristics with Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback