Improving Compiler Heuristics with Machine Learning Mark Stephenson Una-May O’Reilly Martin C. Martin Saman Amarasinghe Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

Computer Architecture Instruction-Level Parallel Processors
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Genetic Algorithms (Evolutionary Computing) Genetic Algorithms are used to try to “evolve” the solution to a problem Generate prototype solutions called.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
Code Generation Steve Johnson. May 23, 2005Copyright (c) Stephen C. Johnson The Problem Given an expression tree and a machine architecture, generate.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Mitigating the Compiler Optimization Phase- Ordering Problem using Machine Learning.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
Biologically Inspired AI (mostly GAs). Some Examples of Biologically Inspired Computation Neural networks Evolutionary computation (e.g., genetic algorithms)
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Instruction Level Parallelism (ILP) Colin Stevens.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Artificial Intelligence Genetic Algorithms and Applications of Genetic Algorithms in Compilers Prasad A. Kulkarni.
Review Best-first search uses an evaluation function f(n) to select the next node for expansion. Greedy best-first search uses f(n) = h(n). Greedy best.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos Architecture and Language Implementation Lab Thesis Seminar University.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Genetic Algorithms Overview Genetic Algorithms: a gentle introduction –What are GAs –How do they work/ Why? –Critical issues Use in Data Mining –GAs.
Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *
Genetic Algorithms CS121 Spring 2009 Richard Frankel Stanford University 1.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer.
Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Xusheng Xiao North Carolina State University CSC 720 Project Presentation 1.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Pipelining and Parallelism Mark Staveley
Project 2: Classification Using Genetic Programming Kim, MinHyeok Biointelligence laboratory Artificial.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Optimization Problems
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
Warehouse Lending Optimization Paul Parker (2016).
Ghent University Veerle Desmet Lieven Eeckhout Koen De Bosschere Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction.
Automatic Feature Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems.
Optimization Problems
Code Optimization.
Improving Compiler Heuristics with Machine Learning
Instruction Scheduling for Instruction-Level Parallelism
Improving Compiler Heuristics with Machine Learning
CSCI1600: Embedded and Real Time Software
Optimization Problems
Genetic Programming Applied to Compiler Optimization
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Genetic Programming Applied to Compiler Optimization
Predicting Unroll Factors Using Supervised Classification
CSCI1600: Embedded and Real Time Software
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Improving Compiler Heuristics with Machine Learning Mark Stephenson Una-May O’Reilly Martin C. Martin Saman Amarasinghe Massachusetts Institute of Technology

System Complexities Compiler complexity Open Research Compiler ~3.5 million lines of C/C++ code Trimaran’s compiler ~ 800,000 lines of C code Architecture Complexity FeaturesPentium® (3M) Pentium 4 (55M) Superscalar Branch prediction Speculative execution MMU and improved FPU

NP-Completeness Many compiler problems are NP-complete Thus, implementations can’t be optimal Compiler writers rely on heuristics In practice, heuristics perform well …but, require a lot of tweaking Heuristics often have a focal point Rely on a single priority function

Priority Functions A heuristic’s Achilles heel A single priority or cost function often dictates the efficacy of a heuristic Priority functions rank the options available to a compiler heuristic List scheduling (identifying instructions in worklist to schedule first) Graph coloring register allocation (selecting nodes to spill) Hyperblock formation (selecting paths to include) Any priority function is legal

Our Proposal Use machine-learning techniques to automatically search the priority function space Increases compiler’s performance Reduces compiler design complexity

Can focus on a small portion of an optimization algorithm Don’t need to worry about legality checking Small change can yield big payoffs Clear specification in terms of input/output Prevalent in compiler heuristics Qualities of Priority Functions

An Example Optimization Hyperblock Scheduling Conditional execution is potentially very expensive on a modern architecture Modern processors try to dynamically predict the outcome of the condition This works great for predictable branches… But some conditions can’t be predicted If they don’t predict correctly you waste a lot of time

Example Optimization Hyperblock Scheduling if (a[1] == 0) else Assume a[1] is 0 time resources Misprediction

Example Optimization Hyperblock Scheduling if (a[1] == 0) else Assume a[1] is 0 time resources

Example Optimization Hyperblock Scheduling (using predication) if (a[1] == 0) else Assume a[1] is 0 time resources Processor simply discards results of instructions that weren’t supposed to be run

Example Optimization Hyperblock Scheduling There are unclear tradeoffs In some situations, hyperblocks are faster than traditional execution In others, hyperblocks impair performance Many factors affect this decision Accuracy of branch predictor Availability of parallel execution resources Effectiveness of the compiler’s scheduler Parallelizability and predictability of the program Hard to model

Example Optimization Hyperblock Scheduling [Mahlke] Find predicatable regions of control flow Enumerate paths of control in region Exponential, but in practice it’s okay Prioritize paths based on four path characteristics The priority function we want to optimize Add paths to hyperblock in priority order

Trimaran’s Priority Function Favor frequently executed paths Favor short paths Penalize paths with hazardsFavor parallel paths

Our Approach Trimaran uses four characteristics What are the important characteristics of a hyperblock formation priority function? Our approach: Extract all the characteristics you can think of and let a learning technique find the priority function

Genetic Programming Searching algorithm analogous to Darwinian evolution Maintain a population of expressions num_ops 2.3 predictability 4.1 -/ *

Genetic Programming Searching algorithm analogous to Darwinian evolution Maintain a population of expressions Selection The fittest expressions in the population are more likely to reproduce Reproduction Crossing over subexpressions of two expressions Mutation

General Flow Create initial population (initial solutions) Evaluation Selection Randomly generated initial population seeded with the compiler writer’s best guess Create Variants done?

General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Compiler is modified to use the given expression as a priority function Each expression is evaluated by compiling and running the benchmark(s) Fitness is the relative speedup over Trimaran’s priority function on the benchmark(s)

General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Just as with Natural Selection, the fittest individuals are more likely to survive

General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Use crossover and mutation to generate new expressions And thus, generate new compilers

Experimental Setup Collect results using Trimaran Simulate a VLIW machine 64 GPRs, 64 FPRs, 128 PRs 4 fully pipelined integer FUs 2 fully pipelined floating point FUs 2 memory units (L1:2, L2:7, L3:35) Replace priority functions in IMPACT with our GP expression parser and evaluator

Outline of Results High-water mark Create compilers specific to a given application and a given data set Essentially partially evaluating the application Application-specific compilers Compiler trained for a given application and data set, but run with an alternate data set General-purpose compiler Compiler trained on multiple applications and tested on an unrelated set of applications

Training the Priority Function Compiler A.c B.c C.c D.c A B C D 12

Training the Priority Function Application-Specific Compilers Compiler A.c B.c C.c D.c A B C D 12

Hyperblock Results Application-Specific Compilers (High-Water Mark) compress g721encodeg721decode huff_dechuff_enc rawcaudiorawdaudio toast mpeg2dec Average Speedup Training inputNovel input (add (sub (cmul (gt (cmul $b $d17)…$d7)) (cmul $b $d28))) (add (div $d20 $d5) (tern $b2 $d0 $d9))

Training the Priority Function General-Purpose Compilers Compiler A.c B.c C.c D.c A B C D 1

Hyperblock Results General-Purpose Compiler decodrle4 codrle4 g721decodeg721encode rawdaudiorawcaudio toast mpeg2dec 124.m88ksim 129.compress huff_enc huff_dec Average Speedup Train data setNovel data set

Validation of Generality Testing General-Purpose Applicability unepic djpeg rasta 023.eqntott 132.ijpeg 052.alvinn 147.vortex 085.cc1 art 130.li osdemo mipmap Average Speedup

Running Time Application specific compilers ~1 day using 15 processors General-purpose compilers Dynamic Subset Selection [Gathercole] Run on a subset of the training benchmarks at a time Memoize fitnesses ~1 week using 15 processors This is a one time process! Performed by the compiler vendor

(add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) GP Hyperblock Solutions General Purpose Intron that doesn’t affect solution

(add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) GP Hyperblock Solutions General Purpose Favor paths that don’t have pointer dereferences

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) Favor highly parallel (fat) paths

GP Hyperblock Solutions General Purpose (add (sub (mul exec_ratio_mean ) ) (mul (cmul (not has_pointer_deref) (mul num_paths) (mul (add (sub (mul (div num_ops dependence_height) ) exec_ratio) (sub (mul (cmul has_unsafe_jsr predict_product_mean ) (sub num_ops_max)) (sub (mul dependence_height_mean num_branches_max) num_paths))))))) If a path calls a subroutine that may have side effects, penalize it

Our Proposal Use machine-learning techniques to automatically search the priority function space Increases compiler’s performance Reduces compiler design complexity

Eliminate the Human from the Loop So far we have tried to improve existing priority functions Still a lot of person-hours were spent creating the initial priority functions Observation: the human-created priority functions are often eliminated in the 1 st generation What if we start from a completely random population (no human-generated seed)?

Another Example Register Allocation An old, established problem Hundreds of papers on the subject Priority-Based Register Allocation [Chow,Hennessey] Uses a priority function to determine the worth of allocating a register Let’s throw our GP system at the problem and see what it comes up with

Register Allocation Results General-Purpose Compiler rawcaudio rawdaudio huff_enchuff_dec 129.compress mpeg2dec g721encode g721decode average Speedup Train data setNovel data set

Validation of Generality Testing General-Purpose Applicability decodrle4 codrle4 130.li djpeg unepic 124.m88ksim 023.eqntott 132.ijpeg 085.cc1 147.vortex average Speedup

Importance of Priority Functions Speedup over a constant priority function rawcaudio rawdaudio huff_enc huff_dec 129.compress mpeg2dec g721encode g721decode average Speedup (over no function) Trimaran's functionGP's function

Advantages our System Provides Engineers can focus their energy on more important tasks Can quickly retune for architectural changes Can quickly retune when compiler changes Can provide compilers catered toward specific application suites e.g., consumer may want a compiler that excels on scientific benchmarks

Related Work Calder et al. [TOPLAS-19] Fine tuned static branch prediction heuristics Requires a priori classification by a supervisor Monsifrot et al. [AIMSA-02] Classify loops based on amenability to unrolling Also used a priori classification Cooper et al. [Journal of Supercomputing-02] Use GAs to solve phase ordering problems

Conclusion Performance talk and a complexity talk Take a huge compiler, optimize one priority function with GP and get exciting speedups Take a well-known heuristic and create priority functions for it from scratch There’s a lot left to do

Why Genetic Programming? Many learning techniques rely on having pre- classified data (labels) e.g., statistical learning, neural networks, decision trees Priority functions require another approach Reinforcement learning Unsupervised learning Several techniques that might work well e.g., hill climbing, active learning, simulated annealing

Why Genetic Programming Benefits of GP Capable of searching high-dimensional spaces It is a distributed algorithm The solutions are human readable Nevertheless…there are other learning techniques that may also perform well

Genetic Programming Reproduction num_ops 2.3 predictability 4.1-/ * branches 7.5 *1.1 / branches 7.5 * predictability 4.1 / * 1.1 / num_ops 2.3 -