University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SAGE: Self-Tuning Approximation for Graphics Engines

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Architectural Optimizations David Ojika March 27, 2014.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

GPU Architecture and Programming

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

CS 732: Advance Machine Learning

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Gwangsun Kim, Jiyun Jeong, John Kim

Employing compression solutions under openacc

CS427 Multicore Architecture and Parallel Computing

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati 2, Mojtaba Mehrara 3, Janghaeng Lee 1 and Scott Mahlke University of Michigan - Ann Arbor 2 Microsoft Research 3 NVIDIA Research

University of Michigan Electrical Engineering and Computer Science GPU Performance Gap High performance at low cost Peak performance is difficult to achieve 2 GeForce GTX 480 GeForce GTX 280 GeForce 8800 GTX GeForce 7800 GTX GeForce GTX 590 GeForce GTX 680 In Practice

University of Michigan Electrical Engineering and Computer Science TMV Performance on Various Input 3 Square Matrix Rectangular Matrix

University of Michigan Electrical Engineering and Computer Science GPU Execution Model 4 Grid 1 SM 0 Shared Regs SM 1 Shared Regs SM 2 Shared Regs SM 3 Shared Regs SM 7 Shared Regs Executes Thread

University of Michigan Electrical Engineering and Computer Science Transposed Matrix Vector Multiplication (4 x 1M) 5 SM 0 Block 0 Thread 0 ~ 15 Block 3 Block 1 Block Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared IDLE

University of Michigan Electrical Engineering and Computer Science Transposed Matrix Vector Multiplication (1M x 4) 6 SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared SM Regs Shared Block 0 ~ 7 Block 8 ~ 15 Block 1,000, ,000 blocks / SM

University of Michigan Electrical Engineering and Computer Science GPU Programming Challenge - Portability 7 GPU Architectures Input Matrix SizeSource Code 4 x 1MGTX285_MV_4_1M.cu 128 x 32KGTX285_MV_128_32K.cu 32K x 128GTX285_MV_32K_128.cu 1M x 4GTX285_MV_1M_4.cu 4 x 1MGTX580_MV_4_1M.cu 128 x 32KGTX580_MV_128_32K.cu 32K x 128GTX580_MV_32K_128.cu 1M x 4GTX580_MV_1M_4.cu 4 x 1MGTX680_MV_4_1M.cu 128 x 32KGTX680_MV_128_32K.cu 32K x 128GTX680_MV_32K_128.cu 1M x 4GTX680_MV_1M_4.cu Fastest Matrix-Vector Multiplication for any GPU for any input size Cores : 240 Cores : 512 Cores :

University of Michigan Electrical Engineering and Computer Science Adaptic Adaptive Input-aware Compilation for GPUs –Device-Portable –Input-Portable –Programmers can focus on the algorithms without concerning about low-level details Streaming Language –Higher-level of abstraction –Separating Memory Access from Algorithm –e.g) StreamIt 8

University of Michigan Electrical Engineering and Computer Science Stream It Higher-level of abstraction Decoupling computation and memory accesses Coarse grain exposed parallelism, exposed communication Streaming actors use buffers to communicate A lot of recent works on extending portability of streaming applications 9

University of Michigan Electrical Engineering and Computer Science Compilation Flow in Adaptic 10 Input-aware Optimization Input-unaware Optimization StreamIt Code Target GPUInput Range Offline Compilation Performance Model Memory Access Optimization Actor Segmentation Actor Integration Why? Global Memory Accesses Large access latency Optimizations Memory Restructuring Coalesced Access Neighboring Access Data Reuse Splits Actors More blocks will be generated Alleviate resource under-utilization Optimizations Stream Reduction Intra-actor Parallelization Integrate Actors Merge several actors into one Alleviate high resource contention Optimizations Vertical Integration Horizontal Integration Executable Smallest Input Largest Input Small Input Large Input Input size? Launch Kernel Kernel 0Kernel 1Kernel 2Kernel 3 Several CUDA Kernels for various input range

University of Michigan Electrical Engineering and Computer Science Memory Optimization Global Memory - Large access latency Not access the words in sequence No coalescing 11 A[i, j]  Actor A has i pops and j pushes Thread 1 Thread 2 Thread 3 Thread A[4,4] Global Memory A[4,4]

University of Michigan Electrical Engineering and Computer Science Memory Optimization Global Memory - Large access latency Not access the words in sequence No coalescing 12 Thread 1 Thread 2 Thread 3 Thread A[4,4] Global Memory A[4,4] A[i, j]  Actor A has i pops and j pushes

University of Michigan Electrical Engineering and Computer Science Actor Segmentation 13 4 x 1M Transposed Matrix-Vector Multiplication Block 0 Block 3 Block 1 Block 2 Block 96 Block 32 Block 64 ~ Block 0 Block 31

University of Michigan Electrical Engineering and Computer Science Actor Integration Merges several actors or threads to balance threads’ workloads Vertical integration: reducing off-chip memory traffic by storing intermediate results in the shared memory. Horizontal integration : reducing synchronization overhead and also lets the merged actors share instructions. 14

University of Michigan Electrical Engineering and Computer Science Experimental Setup CPU - Intel Xeon X5650 GPU –NVidia Telsa C2050 3GB GDDR 5 –NVidia GTX 285 2GB GDDR 2 Benchmarks –CUBLAS Library 3.2 –NVidia SDK

University of Michigan Electrical Engineering and Computer Science Result( Matrix Vector Multlipication) 16

University of Michigan Electrical Engineering and Computer Science Results (Speedup) 17 Input Size

University of Michigan Electrical Engineering and Computer Science Results(BiCGSTAB) 18 Input unaware

University of Michigan Electrical Engineering and Computer Science Summary Performance of GPU is affected by –GPU Model / Input CUDA / OpenCL Programming Model –Lacks Architecture and Input Portability Scientific Applications use irregular input –Hard to get optimized performance Proposed Adaptic –Architecture and input portable /w streaming language –Showed speedup over CUBLAS / SDK in various input range 19

University of Michigan Electrical Engineering and Computer Science Q & A 20