PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING

PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING
Mark Stephenson & Saman Amarasinghe Massachusetts Institute of Technology Computer Science and Artificial Intelligence Lab

INTRODUCTION & MOTIVATION
Compiler heuristics rely on detailed knowledge of the system Compiler interactions not understood Architectures are complex Features Pentium® (3M) Pentium 4 (55M) Superscalar  Hyperthreading Speculative execution Improved FPU This talk is really motivated by systems complexity. To do a good job, the compiler has to have detailed knowledge about the system it is compiling to. So things are getting harder, not easier.

HEURISTIC DESIGN Current approach to heuristic development is somewhat ad hoc Can compiler writers learn anything from baseball? Is it feasible to deal with empirical data? Can we use statistics and machine learning to build heuristics? Just like baseball players, these systems are too complicated to model completely.

CASE STUDY Loop unrolling
Code expansion can degrade performance Increased live ranges, register pressure A myriad of interactions with other passes Requires categorization into multiple classes i.e., what’s the unroll factor?

ORC’S HEURISTIC (UNKNOWN TRIPCOUNT)
if (trip_count_tn == NULL) { UINT32 ntimes = MAX(1, OPT_unroll_times-1); INT32 body_len = BB_length(head); while (ntimes > 1 && ntimes * body_len > CG_LOOP_unrolled_size_max) ntimes--; Set_unroll_factor(ntimes); } else { … }

ORC’S HEURISTIC (KNOWN TRIPCOUNT)
} else { BOOL const_trip = TN_is_constant(trip_count_tn); INT32 const_trip_count = const_trip ? TN_value(trip_count_tn) : 0; INT32 body_len = BB_length(head); CG_LOOP_unroll_min_trip = MAX(CG_LOOP_unroll_min_trip, 1); if (const_trip && CG_LOOP_unroll_fully && (body_len * const_trip_count <= CG_LOOP_unrolled_size_max || CG_LOOP_unrolled_size_max == 0 && CG_LOOP_unroll_times_max >= const_trip_count)) { Set_unroll_fully(); Set_unroll_factor(const_trip_count); UINT32 ntimes = OPT_unroll_times; ntimes = MIN(ntimes, CG_LOOP_unroll_times_max); if (!is_power_of_two(ntimes)) { ntimes = 1 << log2(ntimes); } while (ntimes > 1 && ntimes * body_len > CG_LOOP_unrolled_size_max) ntimes /= 2; if (const_trip) { while (ntimes > 1 && const_trip_count < 2 * ntimes) Set_unroll_factor(ntimes); Our approach is going to be similar; we’re going to extract a bunch of these characteristics then let a ML algorithm learn the heuristics for us.

SUPERVISED LEARNING Supervised learning algorithms try to find a function F(X) → Y X : vector of characteristics that define a loop Y : empirically found best unroll factor 1 2 Here I try to pictorially show how supervised learning works. We start with a training set of examples, where examples contain a description of the loops and the empirically found best option for the loop. We then let the learning algorithm find a function F(X) that tries to minimize the error on the training set. Note that the mapping usually won’t be perfect because the learning algorithm will often try to sacrifice its performance on the training set to achieve a better generalization error. Then, when the mapping is in place, we can predict the label for unseen examples. I’ll talk more about the learning algorithms we use later. 3 Loops 4 Unroll Factors 5 6 7 8 F(X)

EXTRACTING THE DATA Extract features
Most features readily available in ORC Kitchen sink approach Finding the labels (best unroll factors) Added instrumentation pass Assembly instructions inserted to time loops Calls to a library at all exit points Compile and run at all unroll factors (1.. 8) For each loop, choose the best one as the label Features and labels are collected at the same time, and it is a fully automated process.

LEARNING ALGORITHMS Prototyped in Matlab
Two learning algorithms classified our data set well Near neighbors Support Vector Machine (SVM) Both algorithms classify quickly Train at the factory No increase in compilation time Matlab allowed us to use the plug and chug methodology. We downloaded – or the case of NN – wrote our own simple classifiers and quickly tested them out.

NEAR NEIGHBORS # FP operations # branches unroll don’t unroll
Here I show the operation of nearest neighbors for 2 classes. But, extending this to multiple classes is trivial – it works the same. This is a purely fabricated example. Training the algorithm is as simple as populating a database. The intuition behind this algorithm is that we want to treat similar loops similarly. # branches unroll don’t unroll

SUPPORT VECTOR MACHINES
Map the original feature space into a higher-dimensional space (using a kernel) Find a hyperplane that maximally separates the data

SUPPORT VECTOR MACHINES
# branches # FP operations # branches2 # FP operations  unroll don’t unroll

PREDICTION ACCURACY Leave-one-out cross validation
Filter out ambiguous training examples Only keep obviously better examples (1.05x) Throw away obviously noisy examples NN SVM ORC Accuracy 62% 65% 16% First order approximation of how well a classifier works as a predictor.

REALIZING SPEEDUPS (SWP DISABLED)
To stress the loop unroller we disabled SWP, otherwise all –O3 optimizations are performed. Note that we couldn’t get fma3d to compile correctly with our instrumentation library. And we can’t compile eon because it is a c++ benchmark. Similar to LOOCV, when we compile a benchmark we only use filtered training examples from the other benchmarks, excluding all examples from the benchmark being compiled. 5% overall, 9% for specfp benchmarks (perfect for spec is 12%). The blue bars represent the best that we could hope to achieve. One note is that you can use our system as a glorified profiler.

FEATURE SELECTION Feature selection is a way to identify the best features Start with loads of features Small feature sets are better Learning algorithms run faster Are less prone to overfitting the training data Useless features can confuse learning algorithms

FEATURE SELECTION CONT. MUTUAL INFORMATION SCORE
Measures the reduction of uncertainty in one variable given knowledge of another variable Does not tell us how features interact with each other We use the kitchen sink approach to feature extraction – we extract all features that we think might be important. But learning algorithms often work better on smaller feature sets. Large feature sets can lead to overfitting the data, or unimportant features can flat-out confuse learning algorithms. So we use feature selection to find the most informative features for us. We can use MIS to tell us how informative a given feature is, or how well it can predict the unroll factor.

FEATURE SELECTION CONT. GREEDY FEATURE SELECTION
Choose single best feature Choose another feature, that together with the best feature, improves classification accuracy most …

FEATURE SELECTION THE BEST FEATURES
Rank Mutual Information Score Greedy Feature Selection with SVM 1. # FP operations 0.59 2. # Operands Loop nest level 0.49 3. Instruction fan-in in DAG 0.34 4. # Live ranges # Branches 0.20 5. # Memory operations 0.13

RELATED WORK Monsifrot et al., “A Machine Learning Approach to Automatic Production of Compiler Heuristics.” 2002 Calder et al., “Evidence-Based Static Branch Prediction Using Machine Learning.” 1997 Cavazos et al., “Inducing Heuristic to Decide Whether to Schedule.” 2004 Moss et al., “Learning to Schedule Straight-Line Code.” 1997 Cooper et al., “Optimizing for Reduced Code Space using Genetic Algorithms.” 1999 Puppin et al., “Adapting Convergent Scheduling using Machine Learning.” 2003 Stephenson et al., “Meta Optimization: Improving Compiler Heuristics with Machine Learning.” 2003 Monsifrot: Supervised binary classification, ours is multiclass.

CONCLUSION Supervised classification can effectively find good heuristics Even for multi-class problems SVM and near neighbors perform well Potentially have big impact Spent very little time tuning the learning parameters Let a machine learning algorithm tell us which features are best

T H E N D

SOFTWARE PIPELINING ORC has been tuned with SWP in mind
Every major release of ORC has had a different unrolling heuristic for SWP Currently 205 lines long Can we learn a heuristic that outperforms ORC’s SWP unrolling heuristic?

REALIZING SPEEDUPS (SWP ENABLED)
We get a 1% improvement! But before you get down on our research, let me explain why this is the case. A big problem is noise. There are several cases where the oracle is outperformed by ORC or one of the learned classifiers. We assumed that loops were independent of each other, and this assumption is acceptable, but not totally correct. And the other reason is loop unrolling does not contribute much when SWP is enabled. The oracle, even though it is a bit noisy, only achieves a 4.5% speedup. If you have long dependence chains, swp can still handle that. As long as you have enough iterations to fill the pipeline. Control flow, procedure calls. Dedicated hardware support. Similar to LOOCV, when we compile a benchmark we only use filtered training examples from the other benchmarks, excluding all examples from the benchmark being compiled.

HURDLES Compiler writer must extract features
Acquiring labels takes time Instrumentation library ~2 weeks to collect data Predictions confined to training labels Have to tweak learning algorithms Noise We believe that machine learning techniques have the potential to radically alter compiler construction methods. We learned a very small portion of a several million line compiler and got a 9% speedup on the specfps, so this approach shows promise. However, there are still many issues that need to be cleared up before this can become a truly viable technique. First, the compiler writer must extract features, and this takes effort and may not always be easy. For us, most of the features that we used were readily available from the ORC infrastructure. We believe that as these techniques are shown to be viable, compiler passes will provide feature extraction tools, much like compiler infrastructures provide generic data flow analysis tools. Also getting the labels takes time. The instrumentation library did take some effort, but again, in the future we expect timing utilities to be included in an infrastructure. Once the instrumentation is in place, this is a completely unsupervised process. Predictions are confined to the labels with which the classifiers were trained. Future work will consider regression which does not have this limitation. We’re using learning algorithms so that we don’t have to tweak the compiler heuristics, but the learning algorithms have parameters that must be tweaked. Once the training data set has been gathered, it’s pretty easy to write a meta-tuner that looks at a subset of the examples at tunes the learning parameters. Near neighbors has 1 parameter that has to be tuned, and the SVM has 2, so this is a much smaller search space than any compiler heuristic. Noise is the single biggest detriment to our work. The finer the granularity at which you measure, the noisier the measurements will be. Modern architectures are helping our cause by including improved performance counters. Future research will explore ways to reduce noisy measurements.

PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING

Similar presentations

Presentation on theme: "PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING

Similar presentations

Presentation on theme: "PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING"— Presentation transcript:

Similar presentations

About project

Feedback