Image Convolution with CUDA

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Implementing a Speech Recognition System on a GPU using CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
GK-12 Sensors! Matrices and Digital Pictures Part II - Matrix operations with digital pictures.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
QCAdesigner – CUDA HPPS project
CSC508 Convolution Operators. CSC508 Convolution Arguably the most fundamental operation of computer vision It’s a neighborhood operator –Similar to the.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Matrix Multiplication The Introduction. Look at the matrix sizes.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
CS/EE 217 – GPU Architecture and Parallel Programming
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
Basic Principles Photogrammetry V: Image Convolution & Moving Window:
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE 759 Project Presentation University of Wisconsin - Madison
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Sequence Alignment 11/24/2018.
CS/EE 217 – GPU Architecture and Parallel Programming
9th Lecture - Image Filters
© 2012 Elsevier, Inc. All rights reserved.
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© 2012 Elsevier, Inc. All rights reserved.
© 2012 Elsevier, Inc. All rights reserved.
© 2012 Elsevier, Inc. All rights reserved.
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Parallel Computation Patterns (Stencil)
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Convolution Layer Optimization
Mattan Erez The University of Texas at Austin
Rui (Ray) Wu Unified Cuda Memory Rui (Ray) Wu
Arrays and Matrices Prof. Abdul Hameed.
6- General Purpose GPU Programming
Force Directed Placement: GPU Implementation
Presentation transcript:

Image Convolution with CUDA Victor Podlozhnyuk Image Convolution with CUDA 2017.03.23 Whiyoung Jung Dept. of Electrical Engineering, KAIST

Contents Introduction Optimizing for Memory Usage Image Convolution Convolution with Shared Memory Optimizing for Memory Usage Shared Memory Usage Memory Coalescence Performance Comparison

Introduction

Image Convolution Elementwise Sum all elements 3 1 2 -1 1 2 4 5 3 7 -1 -6 -3 2 -6 -2 1 Elementwise Sum all elements -1 1 -2 2 n × m multiplications / pixel Filter kernel (n × m matrix)

Image Convolution Separable Filter Row filter Filter kernel -1 1 -2 2 1 2 -1 1 Row filter Filter kernel Column filter

Image Convolution Elementwise Sum all elements 1 -1 1 2 4 5 3 7 -1 7 9 12 11 8 10 16 23 2 5 20 6 13 3 15 4 1 2 -1 Elementwise Sum all elements 1 2 n multiplications / pixel Column filter (n × 1 matrix)

Image Convolution (n + m) multiplications / pixel in total Elementwise 7 9 12 11 8 10 16 23 2 5 20 6 18 13 3 15 4 -6 11 2 5 -11 5 Elementwise Sum all elements -1 1 Row filter (1 × m matrix) m multiplications / pixel (n + m) multiplications / pixel in total

Convolution using Shared Memory Apron pixels Image pixels Pixels calculated in a block Convolution filter Store into the shared memory of the block Image

Convolution using Shared Memory Image pixels Row Convolution filter Pixels calculated in a block Image pixels Column Convolution filter Image Store into the shared memory of the block

Focus of Contents For separable filter, how to use shared memory Compare the image convolution with and without shared memory

Optimizing for Memory Usage

Optimizing Memory Usage Total Number of Loaded Pixels into Shared Memory Aligned and Coalesced Global Memory Access

Shared Memory Usage Maximize image pixels and minimize apron pixels i.e. minimize #apron pixels / #image pixels Better

Shared Memory Usage More pixels to be loaded in each thread block Then, each thread must process more than one pixel.

Memory Coalescence Memory coalescing to global memory (row convolution) We can make image block flexibly But, size of apron block is fixed Memory access by a warp may not be aligned Kernel_Radius Row_Tile_W Kernel_Radius

Memory Coalescence Memory coalescing to global memory (row convolution) Put additional threads on the leading edge of the processing tile, in order to make threadIdx == 0 always reading aligned address so that all warps access global memory at aligned address Kernel_Radius Row_Tile_W Kernel_Radius Kernel_Radius_Aligned blockDim.x

Memory Coalescence Memory coalescing to global memory (column convolution) Choose proper weight of the tile so that all warp access aligned global memory No additional threads are needed to access aligned address Kernel_Radius Column_Tile_W (blockDim.x) Kernel_Radius

Row convolution Algorithm Load main data Load apron data on left & right side

Row convolution Algorithm Do row convolution and save

Source code Source code (in CUDA sample) convolutionSeparable.cu Convolution with shared memory convolutionSeparable_kernel.cu Convolution without shared memory convolutionSeparable_gold.cpp Convolution in CPU

Performance Comparison With shared memory Without shared memory

Conclusion Use of shared memory improves the performance Memory coalescing technique can result in better performance

Thank you