ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING

Slides:



Advertisements
Similar presentations
NITRO: A Framework for Adaptive Code Variant Tuning
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Enhancing GPU for Scientific Computing Some thoughts.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Implementing a Speech Recognition System on a GPU using CUDA
Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
QCAdesigner – CUDA HPPS project
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.
My Coordinates Office EM G.27 contact time:
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Analysis of Sparse Convolutional Neural Networks
Computer Architecture: Parallel Task Assignment
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
Multi-Story Power Distribution Networks for GPUs
Resource Elasticity for Large-Scale Machine Learning
Graphics Processing Unit
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
The Yin and Yang of Processing Data Warehousing Queries on GPUs
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Wei Jiang Advisor: Dr. Gagan Agrawal
GENERAL VIEW OF KRATOS MULTIPHYSICS
Registers.
A Comparison-FREE SORTING ALGORITHM ON CPUs
Introduction to CUDA.
Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.
6- General Purpose GPU Programming
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING Saurav Muralidharan Amit Roy Mary Hall Michael Garland* Piyush Rai** University of Utah, *NVIDIA Research, and **IIT Kanpur

Introduction Some computations may have multiple implementations, or code variants E.g., different algorithms, data representations, parallelization strategies, etc. Performance may depend on execution context (input and architecture characteristics) Current autotuning systems can handle input-sensitivity What about architecture sensitivity? Variant 1 Variant 2 Labeled Training Data Training Inputs Exhaustive Search Learning Model

Motivation

Architecture-Adaptive Tuning Given: Set of source architectures and a target architecture Labeled training data on source architectures Device features characterizing both source and target Goal: Build a code variant selection model on target architecture Constraint: No training data collection on target architecture Source Architectures Training Data Device Features ???? Target Model Offline Runtime Input Target Device Features Model Query Selected Variant Target Model

Multi-Task Learning Learn various tasks simultaneously to capture intrinsic relatedness among tasks Approach: Treat each architecture as a task and perform feature vector concatenation Final Training Data Source Architectures TSOURCE1 DFSOURCE1 Training Data (T) TSOURCE2 DFSOURCE2 Target Model Device Features (DF) .

Device Features Obtained via deviceQuery, published specifications (static), or micro-benchmarks (custom)

Profile Device Feature Selection Idea: Use profiling data Develop model for application proxies on source Relate proxy to computation to find relevant device features Application proxies Each of the 5 proxies is a small application Stresses a separate set of hardware components For intensity X, produce roughly X% instructions of particular kind SP-FLOP - 40% // 6 loads and stores, // 4 floating-point instructions A[i] = A[i+1]*beta + alpha; A[i+1] = A[i+2]*beta + alpha; A[i+2] = A[i+3];

Profile Device Feature Selection For each proxy, build a model that maps from profiling metrics of the proxy to intensity Use the models to predict relevant hardware components and associated device features Profiling Data SP-FLOP @ 20% Application Profiling Data SP-FLOP @ 40% Proxy Models SP-FLOP @ 60% SP-FLOP DP-FLOP MEM-BW MEM-BW ATOMIC ATOMIC SHMEM-BW SP-FLOP @ 80% SP-FLOP @ 100% global_atomic, shared_atomic, peak_gbps, mem_clock_rate, mem_bus_width, l2_cache_size SP-FLOP Model

Cross Validation Device Feature Selection Perform multi-task learning on the source architectures alone to find the best device features Relies on assumption that good device features on sources are good on target for same computation SOURCE 1 TEMP TARGET global_atomic, shared_atomic, peak_gbps SOURCE 2 TEMP TARGET shared_atomic, peak_gbps SOURCE 3 TEMP TARGET shared_atomic, mem_clock_rate shared_atomic, peak_gbps TEMP TARGET SOURCE 4 global_atomic, shared_atomic TEMP TARGET SOURCE 5 shared_atomic, peak_gbps, mem_clock_rate TARGET

Methodology Six NVIDIA GPUs used: CPU: Intel Core i7-4770 Fermi generation: GeForce GTX 480, Tesla C2075 Kepler generation: GeForce GTX 770, Tesla K20c Maxwell generation: GeForce 750 Ti, GeForce GTX 980 CPU: Intel Core i7-4770 CUDA Toolkit 6.5 Profiling metrics normalized w.r.t. total issued instructions Training and test inputs taken from standard sets Training and test sets are distinct Test set much larger than training set to test generalization

Benchmarks Input features specific to each benchmark; details in paper Variants Histogram (CUB) (Sort, Global-Atomic, Shared-Atomic) Variants, (Even-Share, Dynamic) Grid Mappings SpMV (CUSP) CSR Scalar, CSR Vector, ELL, DIA GPU Sort (CUB, ModernGPU) Merge, Locality, Radix BFS (Back40Computing) E-C (Fused/Iterative), C-E (Fused/Iterative), 2-Phase (Fused/Iterative) Pre-Conditioner+Solver (CULA) (CG, BiCGStab) Solvers (Jacobi, Blocked Jacobi, FAInv) Pre-conditioners Transpose (Catanzaro et al.) R2C, C2R, Skinny R2C/C2R Input features specific to each benchmark; details in paper

Architecture Sensitivity Computations exhibit varying degrees of architecture sensitivity Same generation does not always exhibit similar performance characteristics

Prediction Performance Comparable performance to fixed-architecture tuning

Restricted Source Architectures Device feature selection makes the approach more robust with less training

Device Feature Selection Overhead P-DFS (sec) CV-DFS (sec) Histogram 4.70 100.50 SpMV 3.36 12.42 GPU Sort 4.14 56.61 BFS 3.89 30.27 Solver 4.59 36.80 Transpose 4.26 48.24 Much faster than re-training!

Related Work Learning from performance counters Input-adaptivity: Cavazos et al., Parello et al. Input-adaptivity: PetaBricks, Nitro, Ding et al. G-ADAPT Cross-architecture tuning: Magni et al. Parameter tuning systems: Active Harmony, Orio, OpenTuner etc. Domain-specific autotuners: OSKI, SPIRAL, etc.

Conclusions & Future Work Novel approach to cross-architecture autotuning Comparable results to fixed-architecture tuning for 6 benchmarks on 6 NVIDIA GPUs Future work: Learn on one set of resources and adapt to a different set Tuning across architecture classes such as CPUs and GPUs

Thank You!