Predicting Unroll Factors Using Supervised Classification

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Performance of Cache Memory
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Support Vector Machines
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Indian Statistical Institute Kolkata
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Data Mining and Decision Support
Machine Learning in Compiler Optimization By Namita Dave.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Code Optimization.
Virtual memory.
CS203 – Advanced Computer Architecture
Computer Architecture Principles Dr. Mike Frank
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CSSE463: Image Recognition Day 11
idempotent (ī-dəm-pō-tənt) adj
CS2100 Computer Organisation
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Overview of Supervised Learning
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
CSCI1600: Embedded and Real Time Software
Support Vector Machines Introduction to Data Mining, 2nd Edition by
PREDICTING UNROLL FACTORS USING SUPERVISED LEARNING
CSSE463: Image Recognition Day 11
CS 2750: Machine Learning Support Vector Machines
Hyperparameters, bias-variance tradeoff, validation
Register Pressure Guided Unroll-and-Jam
Advanced Computer Architecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Instruction Level Parallelism (ILP)
Applying SVM to Data Bypass Prediction
Image Classification Painting and handwriting identification
Creating Computer Programs
ECE 352 Digital System Fundamentals
Dynamic Hardware Prediction
How to improve (decrease) CPI
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Creating Computer Programs
Instruction Level Parallelism
CSc 453 Final Code Generation
CSCI1600: Embedded and Real Time Software
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 16. Classification (II): Practical Considerations
Design and Analysis of Algorithms
CS2100 Computer Organisation
Presentation transcript:

Predicting Unroll Factors Using Supervised Classification Abhishek Kumar, Christian Smith, Thomas Oliver

Loop Unrolling Loop transformation technique to increase speed by replicating loop body and decreasing branches Pro: Branch penalty decreased/reduces loop overhead Reduce # of executed instructions, ex. same memory access Expose ILP: more flexible scheduling Con: Code expansion: decrease I$ performance. Longer live range: increase register pressure Impacts scheduler, register allocation, memory system, and other optimizations Pros: allows more opportunity in optimizations ex Common sub-expr elim Depending on the circumstances, may adversely affect other important optimizations and reduce overall performance.

Histogram of Optimal Unroll Factors

Heuristics Static, conservative approach. Tuned by humans, does not account for all the parameters and interdependencies Time-consuming and machine-dependent Results in under-utilization g++ Unrolls loops are rolled constant times or can compute rolls in compile time, only unrolls remaining loops if asked Unroll only inner loops Unroll by a set factor that’s a power of 2, and small body LLVM: Compares cost of rolled vs unrolled loop Fully or partially unroll loops with a static trip count, rest unrolled based on threshold https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Options.html https://github.com/gcc-mirror/gcc/blob/master/gcc/loop-unroll.c In ORC, heuristics are redone every release with SWP changes

Supervised Learning Maps an input to an output based on example input-output pairs Performed on training examples <xi, yi> is composed of a feature vector xi and a corresponding label yi. Vector contains measurable characteristics of the object, ex. Loop characteristics Training label indicates which optimization is the best for each training example, ex. Unroll factor Classifier learns how to best map loop characteristics to observed labels using training examples Once trained, hopefully be able to accurately discriminate examples not in training set trip count of the loop, the number of operations in the loop body, the programming language the loop is written in, etc. for every unrollable loop in the benchmarks. Total collected 38 features for these experiments. Measure each loop using eight different unroll factors (1, 2,..., 8), and the label for the loop is the unroll factor that yields the best performance. Each example loop has a vector of characteristics that describes the loop, and a label that indicates what the empirically found best action for the loop is.

Training Data 72 benchmarks from SPEC, Mediabench applications, Perfect suite, etc. 3 languages: C, Fortran, Fortran 90 Open Research Compiler Measure runtime of loops for each unroll factor At entry and exit, capture processor’s cycle counter and place it in the loop’s associated counter using inserted assembly instructions Record the fastest unroll factor as the label for that loop Affects execution time so only use loops that are run for at least 50,000 cycles Run each benchmark 30 times for all unroll factors up to eight SPEC 2000, SPEC ’95, and SPEC ’92 Open Research Compiler — an open source research compiler that targets Itanium architectures, with software pipelining disabled first and then enabled loops that are only run for a few thousand cycles and gets cache miss (in the boundary of the i$) would introduce a lot of noise to the runtime due to mem access

Machine Learning Algorithms The paper uses two supervised learning algorithms to map loop features to unroll factors: Near Neighbor classification Support Vector Machines

Near Neighbor Classification Basic and intuitive technique Easy and fast to train Algorithm: Have a set of training examples <xi, yi> and an input feature vector x Let S = the set all training examples <xi, yi> with Euclidean distance ||xi - x|| < d Value of d is configurable - the paper uses 0.3 Our answer is the most common y among all examples in S

Near Neighbor Classification Have a set of training examples <xi, yi> and an input feature vector x Let S = the set all training examples <xi, yi> with Euclidean distance ||xi - x|| < d The most common y among all examples in S is assigned to x

Support Vector Machines Detailed description not given here Maps d-dimensional feature space to higher dimensional space (easier to separate data) Finds boundaries that maximally separate the classes

Support Vector Machines More complex and longer to train than NN classification SVMs are binary classifiers, so we must do some additional work to get them to work for multi-class problems Less chance of overfitting compared to NN classification Results in better accuracy on novel examples after training

Feature Selection 38 extracted features from which to select Need to determine which features are used as inputs to the model Too many features Longer training times Possible overfitting to training set Harder to understand Too few features Not as accurate Want to eliminate all redundant and irrelevant features

Mutual Information Score (MIS) Measures how much knowledge of best unrolling factor reduces the variance in a feature (higher score means the feature is informative) Does not provide information about how one feature interacts with another Only measures information content, not helpfulness for a specific classifier

Greedy Feature Selection Picks the single best remaining feature for a specific classifier and adds it to the features it uses. Repeat a user-specified number of times. Different for each type of classifier

Results of Feature Selection Used the union of the features in the tables for their classifier Number of instructions in a loop is the de facto standard for heuristics

Oracle Best decision based on timing of individual loop Theoretically best performance for a classifier Not always the best overall because it only accounts for the unrolling of a single loop

Results 65% of the time SVM finds optimal without SWP 14% finds nearly-optimal 79% of the time the classification is within 7% of optimal SVM faster for 19 of 24 benchmarks without SWP Average speedup of 5% on SPEC in general Average speedup of 9% on SPECfp 62% of the time nearest neighbor finds optimal without SWP Faster on 16 of 24 benchmarks Average speedup of 4% SVM is 1% faster with SWP Oracle is 4.4% faster with SWP Software pipelining is very important to ORC (heuristic for loop unrolling is redone every major release), so even this small improvement is notable

Thanks! Any questions?

From Binary to Multi-Class SVMs are binary classifiers - they only support giving a “yes or no” answer Mapping loops to unroll factors is not a binary classification problem, but a multi-class classification problem We can break a multi-class classification problem into multiple binary classification problems, allowing the use of SVMs Technique is called “output codes” Each binary classifier gives a “yes or no” answer for a single class E.g., should we apply an unroll factor of 4 to this loop?

From Binary to Multi-Class: Output Codes Each binary classifier gives a “yes or no” answer for a single class E.g., should we apply an unroll factor of 4 to this loop? The output code of an input x is the concatenated results of all classifiers Each class also has a unique binary code (1 for that class’s classifier, and 0 for all other classifiers) We assign x to the class with the most similar code (according to Hamming distance)