Predicting Unroll Factors Using Supervised Classification

Predicting Unroll Factors Using Supervised Classification
Abhishek Kumar, Christian Smith, Thomas Oliver

Loop Unrolling Loop transformation technique to increase speed by replicating loop body and decreasing branches Pro: Branch penalty decreased/reduces loop overhead Reduce # of executed instructions, ex. same memory access Expose ILP: more flexible scheduling Con: Code expansion: decrease I$ performance. Longer live range: increase register pressure Impacts scheduler, register allocation, memory system, and other optimizations Pros: allows more opportunity in optimizations ex Common sub-expr elim Depending on the circumstances, may adversely affect other important optimizations and reduce overall performance.

Histogram of Optimal Unroll Factors

Heuristics Static, conservative approach.
Tuned by humans, does not account for all the parameters and interdependencies Time-consuming and machine-dependent Results in under-utilization g++ Unrolls loops are rolled constant times or can compute rolls in compile time, only unrolls remaining loops if asked Unroll only inner loops Unroll by a set factor that’s a power of 2, and small body LLVM: Compares cost of rolled vs unrolled loop Fully or partially unroll loops with a static trip count, rest unrolled based on threshold In ORC, heuristics are redone every release with SWP changes

Supervised Learning Maps an input to an output based on example input-output pairs Performed on training examples <xi, yi> is composed of a feature vector xi and a corresponding label yi. Vector contains measurable characteristics of the object, ex. Loop characteristics Training label indicates which optimization is the best for each training example, ex. Unroll factor Classifier learns how to best map loop characteristics to observed labels using training examples Once trained, hopefully be able to accurately discriminate examples not in training set trip count of the loop, the number of operations in the loop body, the programming language the loop is written in, etc. for every unrollable loop in the benchmarks. Total collected 38 features for these experiments. Measure each loop using eight different unroll factors (1, 2,..., 8), and the label for the loop is the unroll factor that yields the best performance. Each example loop has a vector of characteristics that describes the loop, and a label that indicates what the empirically found best action for the loop is.

Training Data 72 benchmarks from SPEC, Mediabench applications, Perfect suite, etc. 3 languages: C, Fortran, Fortran 90 Open Research Compiler Measure runtime of loops for each unroll factor At entry and exit, capture processor’s cycle counter and place it in the loop’s associated counter using inserted assembly instructions Record the fastest unroll factor as the label for that loop Affects execution time so only use loops that are run for at least 50,000 cycles Run each benchmark 30 times for all unroll factors up to eight SPEC 2000, SPEC ’95, and SPEC ’92 Open Research Compiler — an open source research compiler that targets Itanium architectures, with software pipelining disabled first and then enabled loops that are only run for a few thousand cycles and gets cache miss (in the boundary of the i$) would introduce a lot of noise to the runtime due to mem access

Machine Learning Algorithms
The paper uses two supervised learning algorithms to map loop features to unroll factors: Near Neighbor classification Support Vector Machines

Near Neighbor Classification
Basic and intuitive technique Easy and fast to train Algorithm: Have a set of training examples <xi, yi> and an input feature vector x Let S = the set all training examples <xi, yi> with Euclidean distance ||xi - x|| < d Value of d is configurable - the paper uses 0.3 Our answer is the most common y among all examples in S

Near Neighbor Classification
Have a set of training examples <xi, yi> and an input feature vector x Let S = the set all training examples <xi, yi> with Euclidean distance ||xi - x|| < d The most common y among all examples in S is assigned to x

Support Vector Machines
Detailed description not given here Maps d-dimensional feature space to higher dimensional space (easier to separate data) Finds boundaries that maximally separate the classes

Support Vector Machines
More complex and longer to train than NN classification SVMs are binary classifiers, so we must do some additional work to get them to work for multi-class problems Less chance of overfitting compared to NN classification Results in better accuracy on novel examples after training

Feature Selection 38 extracted features from which to select
Need to determine which features are used as inputs to the model Too many features Longer training times Possible overfitting to training set Harder to understand Too few features Not as accurate Want to eliminate all redundant and irrelevant features

Mutual Information Score (MIS)
Measures how much knowledge of best unrolling factor reduces the variance in a feature (higher score means the feature is informative) Does not provide information about how one feature interacts with another Only measures information content, not helpfulness for a specific classifier

Greedy Feature Selection
Picks the single best remaining feature for a specific classifier and adds it to the features it uses. Repeat a user-specified number of times. Different for each type of classifier

Results of Feature Selection
Used the union of the features in the tables for their classifier Number of instructions in a loop is the de facto standard for heuristics

Oracle Best decision based on timing of individual loop
Theoretically best performance for a classifier Not always the best overall because it only accounts for the unrolling of a single loop

Results 65% of the time SVM finds optimal without SWP
14% finds nearly-optimal 79% of the time the classification is within 7% of optimal SVM faster for 19 of 24 benchmarks without SWP Average speedup of 5% on SPEC in general Average speedup of 9% on SPECfp 62% of the time nearest neighbor finds optimal without SWP Faster on 16 of 24 benchmarks Average speedup of 4% SVM is 1% faster with SWP Oracle is 4.4% faster with SWP Software pipelining is very important to ORC (heuristic for loop unrolling is redone every major release), so even this small improvement is notable

Thanks! Any questions?

From Binary to Multi-Class
SVMs are binary classifiers - they only support giving a “yes or no” answer Mapping loops to unroll factors is not a binary classification problem, but a multi-class classification problem We can break a multi-class classification problem into multiple binary classification problems, allowing the use of SVMs Technique is called “output codes” Each binary classifier gives a “yes or no” answer for a single class E.g., should we apply an unroll factor of 4 to this loop?

From Binary to Multi-Class: Output Codes
Each binary classifier gives a “yes or no” answer for a single class E.g., should we apply an unroll factor of 4 to this loop? The output code of an input x is the concatenated results of all classifiers Each class also has a unique binary code (1 for that class’s classifier, and 0 for all other classifiers) We assign x to the class with the most similar code (according to Hamming distance)

Predicting Unroll Factors Using Supervised Classification

Similar presentations

Presentation on theme: "Predicting Unroll Factors Using Supervised Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Unroll Factors Using Supervised Classification

Similar presentations

Presentation on theme: "Predicting Unroll Factors Using Supervised Classification"— Presentation transcript:

Similar presentations

About project

Feedback