PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING

Slides:

Advertisements

Similar presentations

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

SVM—Support Vector Machines

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Indian Statistical Institute Kolkata

Sparse vs. Ensemble Approaches to Supervised Learning

Optimizing General Compiler Optimization M. Haneda, P.M.W. Knijnenburg, and H.A.G. Wijshoff.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Sparse vs. Ensemble Approaches to Supervised Learning

CS Instance Based Learning1 Instance Based Learning.

On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

Machine Learning CSE 681 CH2 - Supervised Learning.

Meta Optimization Improving Compiler Heuristics with Machine Learning Mark Stephenson, Una-May O’Reilly, Martin Martin, and Saman Amarasinghe MIT Computer.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Data Mining and Decision Support

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

Machine Learning in Compiler Optimization By Namita Dave.

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

Automatic Feature Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems.

Constructing a Predictor to Identify Drug and Adverse Event Pairs

Big data classification using neural network

PREDICT 422: Practical Machine Learning

Code Optimization.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

School of Computer Science & Engineering

Rule Induction for Classification Using

C++ coding standard suggestion… Separate reasoning from action, in every block. Hi, this talk is to suggest a rule (or guideline) to simplify C++ code.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

CSE 4705 Artificial Intelligence

Support Vector Machines

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

CS 4/527: Artificial Intelligence

An Introduction to Support Vector Machines

CSCI1600: Embedded and Real Time Software

Hyperparameters, bias-variance tradeoff, validation

Machine Learning in Practice Lecture 17

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Predicting Unroll Factors Using Supervised Classification

Tonga Institute of Higher Education IT 141: Information Systems

Information Retrieval

Tonga Institute of Higher Education IT 141: Information Systems

Supervised machine learning: creating a model

Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,

SVMs for Document Ranking

CSCI1600: Embedded and Real Time Software

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING Mark Stephenson & Saman Amarasinghe Massachusetts Institute of Technology Computer Science and Artificial Intelligence Lab

INTRODUCTION & MOTIVATION Compiler heuristics rely on detailed knowledge of the system Compiler interactions not understood Architectures are complex Features Pentium® (3M) Pentium 4 (55M) Superscalar  Hyperthreading Speculative execution Improved FPU This talk is really motivated by systems complexity. To do a good job, the compiler has to have detailed knowledge about the system it is compiling to. So things are getting harder, not easier.

HEURISTIC DESIGN Current approach to heuristic development is somewhat ad hoc Can compiler writers learn anything from baseball? Is it feasible to deal with empirical data? Can we use statistics and machine learning to build heuristics? Just like baseball players, these systems are too complicated to model completely.

CASE STUDY Loop unrolling Code expansion can degrade performance Increased live ranges, register pressure A myriad of interactions with other passes Requires categorization into multiple classes i.e., what’s the unroll factor?

ORC’S HEURISTIC (UNKNOWN TRIPCOUNT) if (trip_count_tn == NULL) { UINT32 ntimes = MAX(1, OPT_unroll_times-1); INT32 body_len = BB_length(head); while (ntimes > 1 && ntimes * body_len > CG_LOOP_unrolled_size_max) ntimes--; Set_unroll_factor(ntimes); } else { … }

ORC’S HEURISTIC (KNOWN TRIPCOUNT) } else { BOOL const_trip = TN_is_constant(trip_count_tn); INT32 const_trip_count = const_trip ? TN_value(trip_count_tn) : 0; INT32 body_len = BB_length(head); CG_LOOP_unroll_min_trip = MAX(CG_LOOP_unroll_min_trip, 1); if (const_trip && CG_LOOP_unroll_fully && (body_len * const_trip_count <= CG_LOOP_unrolled_size_max || CG_LOOP_unrolled_size_max == 0 && CG_LOOP_unroll_times_max >= const_trip_count)) { Set_unroll_fully(); Set_unroll_factor(const_trip_count); UINT32 ntimes = OPT_unroll_times; ntimes = MIN(ntimes, CG_LOOP_unroll_times_max); if (!is_power_of_two(ntimes)) { ntimes = 1 << log2(ntimes); } while (ntimes > 1 && ntimes * body_len > CG_LOOP_unrolled_size_max) ntimes /= 2; if (const_trip) { while (ntimes > 1 && const_trip_count < 2 * ntimes) Set_unroll_factor(ntimes); Our approach is going to be similar; we’re going to extract a bunch of these characteristics then let a ML algorithm learn the heuristics for us.

SUPERVISED LEARNING Supervised learning algorithms try to find a function F(X) → Y X : vector of characteristics that define a loop Y : empirically found best unroll factor 1 2 Here I try to pictorially show how supervised learning works. We start with a training set of examples, where examples contain a description of the loops and the empirically found best option for the loop. We then let the learning algorithm find a function F(X) that tries to minimize the error on the training set. Note that the mapping usually won’t be perfect because the learning algorithm will often try to sacrifice its performance on the training set to achieve a better generalization error. Then, when the mapping is in place, we can predict the label for unseen examples. I’ll talk more about the learning algorithms we use later. 3 Loops 4 Unroll Factors 5 6 7 8 F(X)

EXTRACTING THE DATA Extract features Most features readily available in ORC Kitchen sink approach Finding the labels (best unroll factors) Added instrumentation pass Assembly instructions inserted to time loops Calls to a library at all exit points Compile and run at all unroll factors (1.. 8) For each loop, choose the best one as the label Features and labels are collected at the same time, and it is a fully automated process.

LEARNING ALGORITHMS Prototyped in Matlab Two learning algorithms classified our data set well Near neighbors Support Vector Machine (SVM) Both algorithms classify quickly Train at the factory No increase in compilation time Matlab allowed us to use the plug and chug methodology. We downloaded – or the case of NN – wrote our own simple classifiers and quickly tested them out.

NEAR NEIGHBORS # FP operations # branches unroll don’t unroll Here I show the operation of nearest neighbors for 2 classes. But, extending this to multiple classes is trivial – it works the same. This is a purely fabricated example. Training the algorithm is as simple as populating a database. The intuition behind this algorithm is that we want to treat similar loops similarly. # branches unroll don’t unroll

SUPPORT VECTOR MACHINES Map the original feature space into a higher-dimensional space (using a kernel) Find a hyperplane that maximally separates the data

SUPPORT VECTOR MACHINES # branches # FP operations # branches2 # FP operations  unroll don’t unroll

PREDICTION ACCURACY Leave-one-out cross validation Filter out ambiguous training examples Only keep obviously better examples (1.05x) Throw away obviously noisy examples NN SVM ORC Accuracy 62% 65% 16% First order approximation of how well a classifier works as a predictor.

REALIZING SPEEDUPS (SWP DISABLED) To stress the loop unroller we disabled SWP, otherwise all –O3 optimizations are performed. Note that we couldn’t get fma3d to compile correctly with our instrumentation library. And we can’t compile eon because it is a c++ benchmark. Similar to LOOCV, when we compile a benchmark we only use filtered training examples from the other benchmarks, excluding all examples from the benchmark being compiled. 5% overall, 9% for specfp benchmarks (perfect for spec is 12%). The blue bars represent the best that we could hope to achieve. One note is that you can use our system as a glorified profiler.

FEATURE SELECTION Feature selection is a way to identify the best features Start with loads of features Small feature sets are better Learning algorithms run faster Are less prone to overfitting the training data Useless features can confuse learning algorithms

FEATURE SELECTION CONT. MUTUAL INFORMATION SCORE Measures the reduction of uncertainty in one variable given knowledge of another variable Does not tell us how features interact with each other We use the kitchen sink approach to feature extraction – we extract all features that we think might be important. But learning algorithms often work better on smaller feature sets. Large feature sets can lead to overfitting the data, or unimportant features can flat-out confuse learning algorithms. So we use feature selection to find the most informative features for us. We can use MIS to tell us how informative a given feature is, or how well it can predict the unroll factor.

FEATURE SELECTION CONT. GREEDY FEATURE SELECTION Choose single best feature Choose another feature, that together with the best feature, improves classification accuracy most …

FEATURE SELECTION THE BEST FEATURES Rank Mutual Information Score Greedy Feature Selection with SVM 1. # FP operations 0.59 2. # Operands Loop nest level 0.49 3. Instruction fan-in in DAG 0.34 4. # Live ranges # Branches 0.20 5. # Memory operations 0.13

RELATED WORK Monsifrot et al., “A Machine Learning Approach to Automatic Production of Compiler Heuristics.” 2002 Calder et al., “Evidence-Based Static Branch Prediction Using Machine Learning.” 1997 Cavazos et al., “Inducing Heuristic to Decide Whether to Schedule.” 2004 Moss et al., “Learning to Schedule Straight-Line Code.” 1997 Cooper et al., “Optimizing for Reduced Code Space using Genetic Algorithms.” 1999 Puppin et al., “Adapting Convergent Scheduling using Machine Learning.” 2003 Stephenson et al., “Meta Optimization: Improving Compiler Heuristics with Machine Learning.” 2003 Monsifrot: Supervised binary classification, ours is multiclass.

CONCLUSION Supervised classification can effectively find good heuristics Even for multi-class problems SVM and near neighbors perform well Potentially have big impact Spent very little time tuning the learning parameters Let a machine learning algorithm tell us which features are best

T H E N D

SOFTWARE PIPELINING ORC has been tuned with SWP in mind Every major release of ORC has had a different unrolling heuristic for SWP Currently 205 lines long Can we learn a heuristic that outperforms ORC’s SWP unrolling heuristic?

REALIZING SPEEDUPS (SWP ENABLED) We get a 1% improvement! But before you get down on our research, let me explain why this is the case. A big problem is noise. There are several cases where the oracle is outperformed by ORC or one of the learned classifiers. We assumed that loops were independent of each other, and this assumption is acceptable, but not totally correct. And the other reason is loop unrolling does not contribute much when SWP is enabled. The oracle, even though it is a bit noisy, only achieves a 4.5% speedup. If you have long dependence chains, swp can still handle that. As long as you have enough iterations to fill the pipeline. Control flow, procedure calls. Dedicated hardware support. Similar to LOOCV, when we compile a benchmark we only use filtered training examples from the other benchmarks, excluding all examples from the benchmark being compiled.

HURDLES Compiler writer must extract features Acquiring labels takes time Instrumentation library ~2 weeks to collect data Predictions confined to training labels Have to tweak learning algorithms Noise We believe that machine learning techniques have the potential to radically alter compiler construction methods. We learned a very small portion of a several million line compiler and got a 9% speedup on the specfps, so this approach shows promise. However, there are still many issues that need to be cleared up before this can become a truly viable technique. First, the compiler writer must extract features, and this takes effort and may not always be easy. For us, most of the features that we used were readily available from the ORC infrastructure. We believe that as these techniques are shown to be viable, compiler passes will provide feature extraction tools, much like compiler infrastructures provide generic data flow analysis tools. Also getting the labels takes time. The instrumentation library did take some effort, but again, in the future we expect timing utilities to be included in an infrastructure. Once the instrumentation is in place, this is a completely unsupervised process. Predictions are confined to the labels with which the classifiers were trained. Future work will consider regression which does not have this limitation. We’re using learning algorithms so that we don’t have to tweak the compiler heuristics, but the learning algorithms have parameters that must be tweaked. Once the training data set has been gathered, it’s pretty easy to write a meta-tuner that looks at a subset of the examples at tunes the learning parameters. Near neighbors has 1 parameter that has to be tuned, and the SVM has 2, so this is a much smaller search space than any compiler heuristic. Noise is the single biggest detriment to our work. The finer the granularity at which you measure, the noisier the measurements will be. Modern architectures are helping our cause by including improved performance counters. Future research will explore ways to reduce noisy measurements.