GPU Acceleration in ITK v4

Slides:



Advertisements
Similar presentations
Support for Time in ITK Patrick Reynolds Patrick Cheng John Galeotti Arnaud Gelas.
Advertisements

purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
GPU Acceleration in ITK v4
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
National Alliance for Medical Image Computing ITK The Image Segmentation and Registration Toolkit Julien Jomier Kitware Inc.
CUDA ITK Won-Ki Jeong SCI Institute University of Utah.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
Graphics Processing Units
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Getting Started with ITK in Python Language
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
CSE 333 – SECTION 4. Overview Pointers vs. references Const Classes, constructors, new, delete, etc. More operator overloading.
Dr. Engr. Sami ur Rahman Assistant Professor Department of Computer Science University of Malakand Medical Imaging ITK.
Martin Kruliš by Martin Kruliš (v1.0)1.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
ITK Input/Output Kitware Inc.. Overview IO Factory Mechanism Image File IO Transform File IO SpatialObject File IO Logger.
Programming into Slicer3. Sonia Pujol, Ph.D., Harvard Medical School -1- National Alliance for Medical Image Computing Programming into Slicer3 Sonia Pujol,
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
1 ITK Lecture 8 - Neighborhoods Methods in Image Analysis CMU Robotics Institute U. Pitt Bioengineering 2630 Spring Term, 2006 Methods in Image.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Slicer3 for developers – S.Pujol -1- National Alliance for Medical Image Computing Slicer3 Course for developers Sonia Pujol, Ph.D. Surgical Planning Laboratory.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Non-Blocking Concurrent Data Objects With Abstract Concurrency By Jack Pribble Based on, “A Methodology for Implementing Highly Concurrent Data Objects,”
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
ITK Basic Filters Kitware Inc.. ITK Basic Filters Pixel-wise Arithmetic, Casting, Thresholding Mathematical morphology Noise reduction Gaussian, Anisotropic.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Read a DICOM File typedef unsigned charInputPixelType; const unsigned intInputDimension = 2; typedef itk::Image InputImageType; typedef itk::ImageFileReader.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
My Coordinates Office EM G.27 contact time:
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
ITK Statistical Classification Kitware Inc.. Statistical Classification Multiple Components Images K-Means Markov Random Fields.
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Blocked 2D Convolution Ravi Sankar P Nair
CUDA C/C++ Basics Part 2 - Blocks and Threads
NFV Compute Acceleration APIs and Evaluation
GPUNFV: a GPU-Accelerated NFV System
Image Transformation 4/30/2009
An Introduction to GPU Computing
Linchuan Chen, Xin Huo and Gagan Agrawal
Accelerated Single Ray Tracing for Wide Vector Units
Introduction to OpenCL 2.0
NVIDIA Fermi Architecture
More examples How many processes does this piece of code create?
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

GPU Acceleration in ITK v4 ITK v4 summer meeting June 28, 2011 Won-Ki Jeong Harvard University

Overview Introduction Current status Examples Future work

GPU Acceleration GPU as a fast co-processor Problems Massively parallel Huge speed up for certain types of problem Physically independent system Problems Memory management Process management Implementation

Goals Provide high-level GPU abstraction GPU resource management Transparent to existing ITK code Pipeline and object factory supports Basic CMake setup GPU module

Status 28 new GPU classes 6 example GPU image filters GPU image GPU manager classes GPU filter base classes 6 example GPU image filters Gradient anisotropic diffusion Demons registration

Code Development Github (most recent version) Gerrit https://graphor@github.com/graphor/ITK.git Branch: GPU-Alpha Gerrit http://review.source.kitware.com/#change,1923 Waiting for reviewing

CMake Setup Enabling GPU module OpenCL source files will be copied to ITK_USE_GPU Module_ITK-GPUCommon OpenCL source files will be copied to ${ITK_BINARY_DIR}/bin/OpenCL ${CMAKE_CURRENT_BINARY_DIR}/OpenCL

Naming Convention File Class Method itkGPU*** GPU*** ex) itkMeanImageFilter -> itkGPUMeanImageFilter Class GPU*** ex) MeanImageFilter -> GPUMeanImageFilter Method ex) GenerateData() -> GPUGenerateData()

GPU Core Classes GPUContextManager GPUKernelManager GPUDataManager Manage context and command queues GPUKernelManager Load, compile, run GPU code GPUDataManager Data container for GPU GPUImageDataManager

GPU Image Class Derived from itk::Image Compatible to existing ITK filters GPUImageDataManager as a member Separate GPU implementation from Image class Graft(const GPUDataManager *) Implicit(automatic) memory synchronization Dirty flags Time stamp (Modified())

GPU Filter Classes GPUImageToImageFilter GPUDiscreteGaussianImageFilter GPUImageToImageFilter GPUNeighborhoodOperatorImageFilter GPUBoxImageFilter GPUMeanImageFilter GPUInPlaceImageFilter GPUUnaryFunctorImageFilter GPUFiniteDifferenceImageFilter GPUBinaryThresholdImageFilter GPUDenseFiniteDifferenceImageFilter GPUPDEDeformableRegistrationFilter GPUAnisotropicDiffusionImageFilter GPUDemonsRegistrationFilter GPUGradientAnisotropicDiffusionImageFilter

GPU Functor/Function Classes GPUFunctorBase GPUFiniteDifferenceFunction GPUBinaryThreshold GPUAnisotropicDiffusionFunction GPUPDEDeformableRegistrationFunction GPUScalarAnisotropicDiffusionFunction GPUDemonsRegistrationFunction GPUGradiendNDAnisotropicDiffusionFunction

GPUImageToImageFilter Base class for GPU image filters Extend existing itk filters Turn on/off GPU filter IsGPUEnabled() GPU filter implementation GPUGenerateData() template< class TInputImage, class TOutputImage, class TParentImageFilter > class ITK_EXPORT GPUImageToImageFilter: public TParentImageFilter { ... } CRTP: The Curiously Recurring Template Pattern

GPUBinaryThresholdImageFilter Example of functor-based filter GPUUnaryFunctorImageFilter GPU Functor Per-pixel operator SetGPUKernelArguments() Set up GPU kernel arguments Returns # of arguments that have been set

template< class TInput, class TOutput > class GPUBinaryThreshold : public GPUFunctorBase { public: GPUBinaryThreshold() m_LowerThreshold = NumericTraits< TInput >::NonpositiveMin(); m_UpperThreshold = NumericTraits< TInput >::max(); m_OutsideValue = NumericTraits< TOutput >::Zero; m_InsideValue = NumericTraits< TOutput >::max(); } .... int SetGPUKernelArguments(GPUKernelManager::Pointer KernelManager, int KernelHandle) KernelManager->SetKernelArg(KernelHandle, 0, sizeof(TInput), &(m_LowerThreshold)); KernelManager->SetKernelArg(KernelHandle, 1, sizeof(TInput), &(m_UpperThreshold)); KernelManager->SetKernelArg(KernelHandle, 2, sizeof(TOutput), &(m_InsideValue)); KernelManager->SetKernelArg(KernelHandle, 3, sizeof(TOutput), &(m_OutsideValue)); return 4; };

GPUUnaryFunctorImageFilter< TInputImage, TOutputImage, TFunction, TParentImageFilter >::GPUGenerateData() { .... // arguments set up using Functor int argidx = (this->GetFunctor()).SetGPUKernelArguments(this->m_GPUKernelManager, m_UnaryFunctorImageFilterGPUKernelHandle); // arguments set up this->m_GPUKernelManager->SetKernelArgWithImage (m_UnaryFunctorImageFilterGPUKernelHandle, argidx++, inPtr->GetGPUDataManager()); argidx++, otPtr->GetGPUDataManager()); for(int i=0; i<(int)TInputImage::ImageDimension; i++) this->m_GPUKernelManager->SetKernelArg(m_UnaryFunctorImageFilterGPUKernelHandle, argidx++, sizeof(int), &(imgSize[i])); } // launch kernel this->m_GPUKernelManager->LaunchKernel(m_UnaryFunctorImageFilterGPUKernelHandle, ImageDim, globalSize, localSize );

GPUNeighborhoodOperatorImageFilter Pixel-wise inner product of neighborhood and operator coefficients Convolution __constant GPU buffer for coefficients GPU Discrete Gaussian Filter GPU NOIF using 1D Gaussian operator per axis

GPUFiniteDifferenceImageFilter Base class for GPU finite difference filters GPUGradientAnisotropicDiffusionImageFilter GPUDemonsRegistrationFilter New virtual methods GPUApplyUpdate() GPUCalculateChange() Need finite difference function

GPUFiniteDifferenceFunction Base class for GPU finite difference functions GPUGradientNDAnisotropicDiffusionFunction GPUDemonsRegistrationFunction New virtual method GPUComputeUpdate() Compute update buffer using GPU kernel

GPUGradientAnisotropicDiffusionImageFilter GPUScalarAnisotropicDiffusionFunction New virtual method GPUCalculateAverageGradientMagnitudeSquared() GPUGradientNDAnisotropicDiffusionFunction GPU function for gradient-based anisotropic diffusion

GPUDemonsRegistrationFilter Baohua from UPenn New method GPUSmoothDeformationField() GPUReduction

Anisotropic Diffusion Performance Binary Threshold Gaussian Anisotropic Diffusion Mean CPU 1 0.09346 0.7696 24.68 4.069 CPU 2 0.0408 0.7546 13.83 2.086 CPU 3 0.02865 0.6986 10.12 1.542 CPU 4 0.02313 0.763 9.14 1.572 GPU 0.019 0.0532 0.46 0.059 Speed up 1.2~4.9x 13~14x 19~53x 26~68x Intel Xeon Quad Core 3.2GHz CPU vs. NVIDIA GTX 480 GPU 256x256x100 CT volume

Create Your Own GPU Image Filter Step 1: Derive your filter from GPUImageToImageFilter using an existing itk image filter as parent filter type Step 2: Load and compile GPU source code and create kernels in the constructor Step 3: Implement filter by calling GPU kernels in GPUGenerateData()

Example: GPUMeanImageFilter Step 1: Class declaration template< class TInputImage, class TOutputImage > class ITK_EXPORT GPUMeanImageFilter : public GPUImageToImageFilter< TInputImage, TOutputImage, MeanImageFilter< TInputImage, TOutputImage > > { ... } MeanImageFilter as third template parameter to GPUImageToImageFilter class

Example: GPUMeanImageFilter Step 2: Constructor template< class TInputImage, class TOutputImage > GPUMeanImageFilter< TInputImage, TOutputImage>::GPUMeanImageFilter() { std::ostringstream defines;   defines << "#define DIM_" << TInputImage::ImageDimension << "\n";   defines << "#define PIXELTYPE ";   GetTypenameInString( typeid (TInputImage::PixelType), defines ); // OpenCL source path std::string oclSrcPath = "./../OpenCL/GPUMeanImageFilter.cl"; // load and build OpenCL program m_KernelManager->LoadProgramFromFile( oclSrcPath.c_str(), defines.str().c_str()); // create GPU kernel m_KernelHandle = m_KernelManager->CreateKernel("MeanFilter"); } itk_root_path is defined in pathToOpenCLSourceCode.h m_KernelManager is already created in the base class constructor (GPUImageToImageFilter::GPUImageToImageFilter())

Example: GPUMeanImageFilter Step 3: GPUGenerateData() template< class TInputImage, class TOutputImage > void GPUMeanImageFilter< TInputImage, TOutputImage >::GPUGenerateData() {   typedef itk::GPUTraits< TInputImage >::Type  GPUInputImage; typedef itk::GPUTraits< TOutputImage >::Type  GPUOutputImage;   // get input & output image pointer GPUInputImage::Pointer inPtr = dynamic_cast< GPUInputImage * >( this->ProcessObject::GetInput(0) ); GPUOutputImage::Pointer otPtr = dynamic_cast< GPUOutputImage * >( this->ProcessObject::GetOutput(0) ); GPUOutputImage::SizeType outSize = otPtr->GetLargestPossibleRegion().GetSize();   int radius[3], imgSize[3];   for(int i=0; i<(int)TInputImage::ImageDimension; i++)   {  radius[i] = (this->GetRadius())[i];     imgSize[i] = outSize[i]; } GPUTraits to get the type of GPUImage You must cast image to GPUImage (this is important because if you use object factory then the image pointer you use is not GPUImage. Get the filter radius from its parent class (MeanImageFilter)

size_t localSize[3], globalSize[3]; (Continued..) size_t localSize[3], globalSize[3]; localSize[0] = localSize[1] = localSize[2] = 8; for(int i=0; i<(int)TInputImage::ImageDimension; i++) { globalSize[i] = localSize[i]*(unsigned int)ceil((float)outSize[i]/(float)localSize[i]); }   // kernel arguments set up   int argidx = 0; m_KernelManager->SetKernelArgWithImage(m_KernelHandle, argidx++, inPtr->GetGPUDataManager()); otPtr->GetGPUDataManager());     m_KernelManager->SetKernelArg(m_KernelHandle, argidx++, sizeof(int), &(radius[i])); &(imgSize[i])); // launch kernel m_KernelManager->LaunchKernel(m_KernelHandle, (int)TInputImage::ImageDimension, globalSize, localSize); globalsize must be a multiple of localsize

Pipeline Support Allow combining CPU and GPU filters Efficient CPU/GPU synchronization ReaderType::Pointer reader = ReaderType::New(); WriterType::Pointer writer = WriterType::New(); GPUMeanFilterType::Pointer filter1 = GPUMeanFilterType::New(); GPUMeanFilterType::Pointer filter2 = GPUMeanFilterType::New(); ThresholdFilterType::Pointer filter3 = ThresholdFilterType::New(); filter1->SetInput( reader->GetOutput() ); // copy CPU->GPU implicitly filter2->SetInput( filter1->GetOutput() ); filter3->SetInput( filter2->GetOutput() ); writer->SetInput( filter3->GetOutput() ); // copy GPU->CPU implicitly writer->Update(); Filter1 (GPU) Filter2 Filter3 (CPU) Reader Writer Synchronize Efficient synchronization : lazy evaluation

Object Factory Support Create GPU object when possible No need to explicitly define GPU objects // register object factory for GPU image and filter objects ObjectFactoryBase::RegisterFactory(GPUImageFactory::New()); ObjectFactoryBase::RegisterFactory(GPUMeanImageFilterFactory::New()); typedef itk::Image< InputPixelType,  2 >  InputImageType; typedef itk::Image< OutputPixelType, 2 >  OutputImageType; typedef itk::MeanImageFilter< InputImageType, OutputImageType > MeanFilterType; MeanFilterType::Pointer filter = MeanFilterType::New();

Type Casting Image must be casted to GPUImage for auto-synchronization for non-pipelined workflow with object factory Use GPUTraits template <class T> class GPUTraits { public:   typedef T   Type; }; template <class T, unsigned int D> class GPUTraits< Image< T, D > >   typedef GPUImage<T,D>   Type; InputImageType::Pointer img; typedef itk::GPUTraits< InputImageType >::Type GPUImageType;  GPUImageType::Pointer otPtr = dynamic_cast< GPUImageType* >( img );

Future Work Multi-GPU support GPUImage internal types GPUThreadedGenerateData() GPUImage internal types Image (texture) GPU ND Neighbor Iterator

Discussion