Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Slides:

Advertisements

Similar presentations

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Advertisements

GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 3: The CUDA Memory Model.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Computer Engg, IIT(BHU)

CS427 Multicore Architecture and Parallel Computing

Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,

Basic CUDA Programming

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Mattan Erez The University of Texas at Austin

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

Graphics Processing Unit

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical & Computer and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY ABSTRACT CUDA ("Compute Unified Device Architecture") is high level language for GPU programming. GPU Computing with CUDA on the GeForce 8 series is a new approach to computing where hundreds of on-chip processors simultaneously communicate and cooperate to solve complex computing problems up to 100 times faster than traditional approaches. A CUDA-enabled GPU operates as either a flexible thread processor, where thousands of computing programs called threads work together to solve complex problems, or as a streaming processor in specific applications such as imaging where threads do not communicate. CUDA-enabled applications use the GPU for fine grained data-intensive processing, and the multi-core CPUs for complicated coarse grained tasks such as control and data management. We use it here for Image processing algorithms like smoothing to achieve a faster implementation of it. It is well suited to address problems that can be expressed as data- parallel computations – the same program is executed on many data elements in parallel. EXPERIMENTAL RESULTS CUDA on G80 CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. GPU AS DATA-PARALLEL COMPUTING DEVICE REFERENCES [1] NVIDIA CUDA website : GPU devotes more transistors to data processing. Same program is executed on many data elements in parallel. G80 THREAD COMPUTING PIPELINE TECHNICAL SPECIFICATIONS Maximum number of threads per block: 512 Maximum size of each dimension of a grid: 65,535 Number of streaming multiprocessors (SM): 675 MHz Device memory: 768 MB Shared memory per multiprocessor: 16KB divided in 16 banks Constant memory: 64 KB Warp size: 32 threads (16 Warps/Block) L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB DRAM ALU Control Cache ALU... d0d0 d1d1 d2d2 d3d3 ALU Control Cache ALU... d4d4 d5d5 d6d6 d7d7 … … No Scatter: Each shader can write only to one pixel Only gather: Can read data from other pixels DRAM ALU Shared memory Control Cache ALU... d0d0 d1d1 d2d2 d3d3 d0d0 d1d1 d2d2 d3d3 ALU Shared memory Control Cache ALU... d4d4 d5d5 d6d6 d7d7 d4d4 d5d5 d6d6 d7d7 … … The execution time using CUDA is almost 100 times faster. The capability to achieve faster speed depends upon parallelism in the program and proper parameters like block size. Currently only one kernel can run at time on the card, efforts are been made to run more than one kernel simultaneously to achieve more parallelism. Supports only single precision (32 bits). PROGRAMMING MODEL HARDWARE MODEL A set of SIMD multiprocessor with on-hip shared memory A thread has access to the device’s DRAM and on-chip memory through a set of memory spaces of various scopes The host issues a succesion of kernel invocations to the device. Each kernel is executed as a batch of threads organized as a grid of thread blocks. To measure the performance of CUDA, a code for mean filtering was written in CUDA and C++ and the execution time in both cases was found out. Mean Filtering : Center pixel of a block is average of the neighborhood pixels. Input Image and the mean filtered image. Flowchart for a CUDA program. Load Image from the disk Configure block and thread counts Allocate global memory Copy Image to GPU Call the kernel Copy output data back to the CPU Execution Time CONCLUSION It can read and write data at any location in DRAM just like in CPU. It also has a fast op-chip memory with very fast read and write access. Kernel "This work was supported in part by Gordon-CenSSIS, the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems, under the Engineering Research Centers Program of the National Science Foundation (Award Number EEC )." Badrinath Roysam, Professor Dept. of Electrical, Computer, and Systems Engineering Rensselaer Polytechnic Institute, 110 8th Street, Troy, NY Phone: (518) ; Fax: ; CONTACT INFORMATION