Offloading to the GPU: An Objective Approach

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Parallel computer architecture classification
Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,
Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,
OpenFOAM on a GPU-based Heterogeneous Cluster
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
Presented By Chandra Shekar Reddy.Y 11/5/20081Computer Architecture & Design.
GPGPU platforms GP - General Purpose computation using GPU
Department of Electrical Engineering National Cheng Kung University
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Sunpyo Hong, Hyesoon Kim
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Enhancing GPU for Scientific Computing Some thoughts.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
1 Chapter 04 Authors: John Hennessy & David Patterson.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
Implementing a Speech Recognition System on a GPU using CUDA
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
CUDA Performance Study on Hadoop MapReduce Clusters Chen He Peng Du University of Nebraska-Lincoln.
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
IP Routing Processing with Graphic Processors Author: Shuai Mu, Xinya Zhang, Nairen Zhang, Jiaxin Lu, Yangdong Steve Deng, Shu Zhang Publisher: IEEE Conference.
QCAdesigner – CUDA HPPS project
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 3: Computer Architectures
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Sunpyo Hong, Hyesoon Kim
Su-ting, Chuang 1. Outline Introduction Related work Hardware configuration Finger Detection system Optimal parameter estimation framework Conclusion.
My Coordinates Office EM G.27 contact time:
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
Expediting Peer-to-Peer Simulation using GPU Di Niu, Zhengjun Feng Apr. 14 th, 2009.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Parallel Image Processing: Active Contour Algorithm
World’s fastest Machine Learning Engine
Gwangsun Kim, Jiyun Jeong, John Kim
Evaluating Register File Size
The Problem Finding a needle in haystack An expert (CPU)
Improving java performance using Dynamic Method Migration on FPGAs
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Registers.
Implementation of a De-blocking Filter and Optimization in PLX
Presentation transcript:

Offloading to the GPU: An Objective Approach Ajaykumar Kannan, Mario Badr, Parisa Khadem Hamedani, Natalie Enright Jerger {kannanaj, badrmari, parisa, enright}@ece.utoronto.ca University of Toronto

Offloading Regions 1 & 2 to the GPU Offloading to the GPU: An Objective Approach Overview REGION 1 % speedup % energy gain REGION 1 Offloading Regions 1 & 2 to the GPU CPU Only Application REGION 2 REGION 2 % speedup % energy gain

Outline Motivation Overview Methodology Modeling the devices Offloading to the GPU: An Objective Approach Outline Motivation Overview Methodology Modeling the devices Profiling the Application Evaluation of Sample Applications

Motivation: Improving Battery Life And The UX Offloading to the GPU: An Objective Approach Motivation: Improving Battery Life And The UX CPU Power GPU* Port *For some types of applications. Performance

Is It Worth It? Qualitative Quantitative Performs Better Saves Energy Offloading to the GPU: An Objective Approach Is It Worth It? Qualitative Quantitative Performs Better Saves Energy Estimate run-time (T) Estimate energy (E) Given an estimated T and E for a port of the application, managers can objectively decide to commit developers to the project. Goal

Methodology – The Setup Offloading to the GPU: An Objective Approach Methodology – The Setup Intrinsyc Snapdragon 805 Development Board CPU: Quad-Core Krait 450 GPU: Adreno 420

Methodology – The Setup Offloading to the GPU: An Objective Approach Methodology – The Setup Power Measurement Timing Measurement Total power measured ACS712 Current Sensor with 10x amplifier circuit 10-bit ADC to Linux PC Android OS – UNIX time Synchronization mechanism with Linux PC over adb

Application Breakdown Offloading to the GPU: An Objective Approach Application Breakdown Instruction Mix ALU Instructions SISD SIMD Memory Ops Control Flow Other Memory Initializations System Calls

Operational Intensity & The Roofline Model [4] Offloading to the GPU: An Objective Approach Operational Intensity & The Roofline Model [4] Operational Intensity: Number of operations performed for each byte of data transferred Relatively static for a given application Roofline Model Serves as a sanity check for our benchmarks Emphasize this and make sure OpIntensity is clear. Make the graph larger S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.

Operational Intensity & The Roofline Model [4] Offloading to the GPU: An Objective Approach Operational Intensity & The Roofline Model [4] Emphasize this and make sure OpIntensity is clear. Make the graph larger

Operational Intensity & The Roofline Model [4] Offloading to the GPU: An Objective Approach Operational Intensity & The Roofline Model [4] Emphasize this and make sure OpIntensity is clear. Make the graph larger Performance (FLOPs) Operational Intensity

High Level Overview Measurements Characteristics Modelling Validation Offloading to the GPU: An Objective Approach High Level Overview Micro-Benchmarks Measurements Characteristics Modelling Validation Desired Accuracy Models Refine Refine Benchmarks

Estimating From The Bottom Up Offloading to the GPU: An Objective Approach Estimating From The Bottom Up Composable Model [1] Power and Run-Time can be broken down Application Coefficients Characteristic1 Characteristic2 Characteristic3 … Characteristicn P1 P2 P3 Pn Move this to later and explain it better * *Similar for Energy (E) Diop, Tahir, Natalie Enright Jerger, and Jason Anderson. "Power modeling for heterogeneous processors." Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 2014.

Behind the (Look-Up) Table Offloading to the GPU: An Objective Approach Behind the (Look-Up) Table Create Parametrized Micro-Benchmarks Measure Power and Run Time Solve System of Equations [1-3] References in the bottom of the slides Diop, Tahir, Natalie Enright Jerger, and Jason Anderson. "Power modeling for heterogeneous processors." Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 2014. Choi, Jee Whan, et al. "A roofline model of energy." Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 2013. Choi, Jee, et al. "Algorithmic time, energy, and power on candidate HPC compute building blocks." Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014.

Methodology – Parametrized Micro-Benchmarks Offloading to the GPU: An Objective Approach Methodology – Parametrized Micro-Benchmarks for each benchmark { for i = 1 to 3 { run kernel; sleep 2 seconds; } Data type Operation Vector Input Data Size Intensity What is each benchmark??

Methodology – Example Micro-Benchmarks Offloading to the GPU: An Objective Approach Methodology – Example Micro-Benchmarks int id = get_global_id(0); float reg1 = input[id]; float reg2 = 0.0f; reg2 = reg2 + reg1; // ..................................... // Repeat for operations number of times output[id] = reg2;

Linear Regression and the Timing Model Offloading to the GPU: An Objective Approach Linear Regression and the Timing Model Linear Regression performed for each operation at each operational intensity for each device Output – LUT Time Estimate dependent on: Number of operations Number of bytes Current availability of data Simplifiy symbols

Timing Model for the CPU & GPU* Offloading to the GPU: An Objective Approach Timing Model for the CPU & GPU* Data Operation τCPU (ps) τGPU (ps) all 260 14 float add 400 13 float4 91 fma 510 110 multiply 140 divide 2000 Random access 710 150 GPU Constant for different operations CPU Variance for different operations Float4 faster than float Overlap BW with running vs serial We assume worst running case * Does not include transfer time

Power Model Similar Methodology as timing Key factors: Offloading to the GPU: An Objective Approach Power Model Data Operation εCPU,op (pJ) εGPU,op (pJ) πCPU,op(J) πGPU,op (J) all 360 740 25.4 27.2 float add 1600 28 14.8 14.2 float4 260 18.0 2.6 fma 1100 39 21.2 26.0 300 37 20.1 18.6 multiply -590 67 28.1 18.7 divide -12000 20 37.5 32.1 Random access 4000 2900 25.6 25.3 Similar Methodology as timing Key factors: Energy per operation Static power Time taken by application

What’s the problem with this model? Offloading to the GPU: An Objective Approach What’s the problem with this model? Problem: Which portion of the code do you apply this to? Solution: Apply at a basic block granularity Basic block: Portion of code in a program Single point of entry & single point of exit No control flow within basic block Clarify this a bit more.

Profiling the Application: Basic Block Granularity Offloading to the GPU: An Objective Approach Profiling the Application: Basic Block Granularity Instrument BBL & Count: # runs of BBL Unique bytes accessed by BBL # useful operations (ALU & Memory) Obtain: Operational intensity (ι) Working set size (σ) Instruction mix (for each op: vop)

Example Application: Matrix Addition Offloading to the GPU: An Objective Approach Example Application: Matrix Addition Snippet of code comparing it to the table

Example Application: Matrix Addition Offloading to the GPU: An Objective Approach Example Application: Matrix Addition Kernel Instruction Mix Snippet of code comparing it to the table

Example Application: Matrix Addition Offloading to the GPU: An Objective Approach Example Application: Matrix Addition Speed up graph

Example Application: Matrix Addition Offloading to the GPU: An Objective Approach Example Application: Matrix Addition

Sample Application: Median Filter Offloading to the GPU: An Objective Approach Sample Application: Median Filter a1 a2 a3 b1 b2 b3 c1 c2 c3 Source Image Neighboring 3x3 matrix Destination Image

Sample Application: Median Filter Offloading to the GPU: An Objective Approach Sample Application: Median Filter Kernel Instruction Mix

Sample Application: Median Filter Offloading to the GPU: An Objective Approach Sample Application: Median Filter Estimated Time Estimated Energy

Sample Application: Sobel Image Detection Offloading to the GPU: An Objective Approach Sample Application: Sobel Image Detection a1 a2 a3 b1 b2 b3 c1 c2 c3 Original Color Input Image Gray Input Image subMatA Neighboring 3x3 matrix

Sample Application: Sobel Image Detection Offloading to the GPU: An Objective Approach Sample Application: Sobel Image Detection Kernel Instruction Mix

Sample Application: Sobel Image Detection Offloading to the GPU: An Objective Approach Sample Application: Sobel Image Detection Estimated Time Estimated Energy

Offloading to the GPU: An Objective Approach Wrap Up

Offloading to the GPU: An Objective Approach Conclusions Devices have to be characterized at different operational intensities Generate coefficients for each instruction at different operational intensities Instrument and analyze applications at basic block granularity Apply generated coefficients to data obtained

Offloading to the GPU: An Objective Approach References Diop, Tahir, Natalie Enright Jerger, and Jason Anderson. "Power modeling for heterogeneous processors." Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 2014. Choi, Jee Whan, et al. "A roofline model of energy." Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 2013. Choi, Jee, et al. "Algorithmic time, energy, and power on candidate HPC compute building blocks." Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 2014. S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009. M. Kambadur, K. Tang, and M. A. Kim, “Harmony: Collection and analysis of parallel block vectors,” in Proceedings of the 39th Annual International Symposium on Computer Architecture, 2012, pp. 452–463.

Offloading to the GPU: An Objective Approach Questions?

Offloading to the GPU: An Objective Approach Thank you!