Blocked 2D Convolution Ravi Sankar P Nair 010469036.

Slides:

Advertisements

Similar presentations

CSC 360- Instructor: K. Wu Overview of Operating Systems.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

GPU [1] Speaker 高崇閔. Exceed limitation Error massage Line 47 Ecercise2.

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Computer Hardware.

KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor Fall 2014 Presented By: Probir Roy.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

What is a Microprocessor ? A microprocessor consists of an ALU to perform arithmetic and logic manipulations, registers, and a control unit Its has some.

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

SIFT DESCRIPTOR K Wasif Mrityunjay

CS 732: Advance Machine Learning

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.

Assignment 4: Deep Convolutional Neural Networks

Processes and threads.

CS427 Multicore Architecture and Parallel Computing

Image Transformation 4/30/2009

KERNEL ARCHITECTURE.

Implementation of Efficient Check-pointing and Restart on CPU - GPU

Speedup over Ji et al.'s work

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Chapter 1: Intro (excerpt)

פרטים נוספים בסילבוס של הקורס

All-Pairs Shortest Paths

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

Mattan Erez The University of Texas at Austin

Operating System Introduction.

Processes David Ferry CSCI 3500 – Operating Systems

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Patrick Cozzi University of Pennsylvania CIS Spring 2011

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Synchronization These notes introduce:

6- General Purpose GPU Programming

Force Directed Placement: GPU Implementation

Presentation transcript:

Blocked 2D Convolution Ravi Sankar P Nair 010469036

Implement 2D Convolution Source: http://www.songho.ca/dsp/convolution/convolution2d_example.html

Implement 2D Convolution.cpp in GPU Kernel

Implement 2D Convolution.cpp in GPU Kernel Use Constant memory to store M matrix

Implement 2D Convolution.cpp in GPU Kernel Use Constant memory to store M matrix

Performance Testing CPU vs. GPU What is the measured floating-point computation rate for the CPU and GPU kernels on this application? How do they each scale with the size of the input? #include <sys/time.h>

Performance Testing CPU vs. GPU What is the measured floating-point computation rate for the CPU and GPU kernels on this application? How do they each scale with the size of the input? Alternate Timer method

Performance Testing CPU vs. GPU What is the measured floating-point computation rate for the CPU and GPU kernels on this application? How do they each scale with the size of the input? #include <sys/time.h>

Performance Testing CPU vs. GPU 2. How much time is spent as an overhead cost of using the GPU for computation? Consider all code executed within your host function, with the exception of the kernel itself, as overhead. How does the overhead scale with the size of the input?

Performance Testing CPU vs. GPU Table shows values in micro seconds. Run on GTX 480 pacman.ddns.uark.edu Total Setup = Setup M,N + Setup GPU call Over Head GPU = Setup GPU Call – GPU kernel Over Head Setup = Total Setup – GPU kernel Over Head Main = Total Main program – GPU Kernel N and P Total Main Program Setup/read M,N files Setup GPU Function call Total Setup = C+D CPU Kernel GPU Kernel OvH GPU = D-F OvH Setup = E - F OvH Main = B - F 281x80 66139 1692 62634 64326 1354 50 62584 64276 66089 32x32 71080 907 70042 70949 57 36 70006 70913 71044 64x64 72355 3075 68917 71992 238 38 68879 71954 72317 128x128 85528 10438 73781 84219 985 47 73734 84172 85481 256x256 116027 29614 81206 110820 3975 82 81124 110738 115945 512x512 206072 105901 79420 185321 15977 224 79196 185097 205848 1024x1024 572661 411698 78001 489699 64130 844 77157 488855 571817 2048x2048 2061625 1644657 85603 1730260 256114 3089 82514 1727171 2058536

Performance Testing CPU vs. GPU Table shows values in micro seconds. Run on GTX 480 pacman.ddns.uark.edu (Alternate Timer) Total Setup = Setup M,N + Setup GPU call Over Head GPU = Setup GPU Call – GPU kernel Over Head Setup = Total Setup – GPU kernel Over Head Main = Total Main program – GPU Kernel N and P Total Main Program Setup/read M,N files Setup GPU Function call Total Setup = C+D CPU Kernel GPU Kernel OvH GPU = D-F OvH Setup = E - F OvH Main = B - F 281x80 66214 1681 62724 64405 1355 50 62674 64355 66164 32x32 82312 907 81274 82181 57 36 81238 82145 82276 64x64 70401 3087 66953 70040 236 38 66915 70002 70363 128x128 86663 10449 74909 85358 982 47 74862 85311 86616 256x256 115126 29564 80363 109927 3973 83 80280 109844 115043 512x512 204261 105868 77645 183513 15990 221 77424 183292 204040 1024x1024 578057 411822 83242 495064 64099 843 82399 494221 577214 2048x2048 2048527 1635106 81614 1716720 256660 78527 1713633 2045440

Performance Testing CPU vs. GPU Run on GTX 480 pacman.ddns.uark.edu

Performance Testing CPU vs. GPU Table shows values in micro seconds. Run on GTX 295 stargate.uark.edu Total Setup = Setup M,N + Setup GPU call Over Head GPU = Setup GPU Call – GPU kernel Over Head Setup = Total Setup – GPU kernel Over Head Main = Total Main program – GPU Kernel N and P Total Main Program Setup/read M,N files Setup GPU Function call Total Setup = C+D CPU Kernel GPU Kernel OvH GPU = D-F OvH Setup = E - F OvH Main = B - F 281x80 2796273 1335 2793075 2794410 1215 86 2792989 2794324 2796187 32x32 2820670 2127 2818379 2820506 69 64 2818315 2820442 2820606 64x64 2845781 45163 2800109 2845272 220 60 2800049 2845212 2845721 128x128 2876348 55181 2819790 2874971 875 76 2819714 2874895 2876272 256x256 2927615 91452 2831007 2922459 3459 157 2830850 2922302 2927458 512x512 3130441 275434 2834679 3110113 13832 455 2834224 3109658 3129986 1024x1024 3711026 811408 2818357 3629765 55499 1669 2816688 3628096 3709357 2048x2048 6261147 3072243 2842964 5915207 238949 6552 2836412 5908655 6254595

Performance Testing CPU vs. GPU Run on GTX 295 stargate.uark.edu