GPU Acceleration in ITK v4

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Doxygen. Links
ITK Architecture Kitware Inc.. ITK Basics C++ Generic Programming Data Pipeline Multi-threading Streaming Exceptions Events / Observers Tcl, Python and.
ITK-Overview Insight Software Consortium. What is ITK Image Processing Segmentation Registration No Graphical User Interface (GUI) No Visualization.
NA-MIC National Alliance for Medical Image Computing ITK Workshop October 5-8, 2005 Writing a New ITK Filter.
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
A C++ Crash Course Part II UW Association for Computing Machinery Questions & Feedback.
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Multi-GPU and Stream Programming Kishan Wimalawarne.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
1. GPU – History  First true 3D graphics started with early display controllers (video shifters)  They acted as pass between CPU and display  RCA’s.
GPU Acceleration in ITK v4
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
Advanced / Other Programming Models Sathish Vadhiyar.
AMD-SPL Runtime Programming Guide Jiawei. Outline.
CIS 565 Fall 2011 Qing Sun
GPU Architecture and Programming
C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Data Structures Using C++ 2E1 Inheritance An “is-a” relationship –Example: “every employee is a person” Allows new class creation from existing classes.
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
JAVA INTRODUCTION. What is Java? 1. Java is a Pure Object – Oriented language 2. Java is developing by existing languages like C and C++. How Java Differs.
1 Classes II Chapter 7 2 Introduction Continued study of –classes –data abstraction Prepare for operator overloading in next chapter Work with strings.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
My Coordinates Office EM G.27 contact time:
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Chapter 7 Process Environment Chien-Chung Shen CIS/UD
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
NFV Compute Acceleration APIs and Evaluation
Processes and threads.
CS 179: GPU Programming Lecture 1: Introduction 1
5.13 Recursion Recursive functions Functions that call themselves
CS427 Multicore Architecture and Parallel Computing
OpenCL 소개 류관희 충북대학교 소프트웨어학과.
Heterogeneous Programming
Patrick Cozzi University of Pennsylvania CIS Spring 2011
C++ History C++ was designed at AT&T Bell Labs by Bjarne Stroustrup in the early 80's Based on the ‘C’ programming language C++ language standardised in.
CS 179: GPU Programming Lecture 1: Introduction 1
GPU Programming using OpenCL
Introduction to OpenCL 2.0
ITK Architecture Kitware Inc..
Threads CSE 2431: Introduction to Operating Systems
Presentation transcript:

GPU Acceleration in ITK v4 ITK v4 winter meeting Feb 2nd 2011 Won-Ki Jeong, Harvard University (wkjeong@seas.harvard.edu)

Overview Introduction Current status in GPU ITK v4 Examples GPU managers GPU image GPU image filters Examples Future work

GPU Acceleration GPU as a fast co-processor Problems Massively parallel Huge speed up for certain types of problem Physically independent system Problems Memory management Process management Implementation

Proposal Provide high- GPU data structure and filter Developer only need to implement GPU kernel ITK will do dirty jobs GPU image filter framework GPU filter template Pipelining support OpenCL Industry standard Provide GPU data structure and framework for GPU computing in ITK Allow developers to easily create a new GPU filter by only writing GPU kernel code

Quick Summary Is GPU Draft Implementation done? What do we have now? Yes: http://review.source.kitware.com/#change,800 What do we have now? Basic GPU computing framework GPU image and filter class Pipeline and Object Factory supports Basic CMake setup for GPU code

CMake Setup ITK_USE_GPU OpenCL source file location OFF by default Self-contained in Code/GPU except a few minor modification of existing files OpenCL source file location binary_dir/Code/GPU binary_dir is written into pathToOpenCLSourceCode.h

Platforms: NVIDIA, ATI, Intel Devices Context Kernels Programs Command Queues GPU Images

New GPU Classes GPUContextManager GPUDataManager GPUKernelManager GPUImageDataManager GPUKernelManager GPUImage GPUImageToImageFilter GPUMeanImageFilter Basic GPU Objects ITK Object Extension

GPU Context Manager Global GPU resource manager Resources One instance per process All GPU objects should have a pointer to it Resources Platforms Devices Contexts Command queues GetCommandQueue(), GetNumCommandQueue()

GPU Data Manager Base class to manage GPU memory GPU data container Synchronization between CPU & GPU memory Data Container APIs: SetBufferSize() SetCPUBufferPointer() Allocate() protected: GetGPUBufferPointer() Synchronization APIs: SetCPUDirtyFlag() SetGPUDirtyFlag() SetCPUBufferDirty() SetGPUBufferDirty() MakeCPUBufferUpToDate() MakeGPUBufferUpToDate() MakeUpToDate()

Synchronization Dirty flags Time stamp Lightweight Pixel access functions in GPU images Time stamp Better to use sparingly Pipeline

Data Manager Example unsigned int arraySize = 100; // create CPU memory float *a = new float[arraySize]; // create GPU memory GPUDataManager::Pointer b = GPUDataManager::New(); b->SetBufferSize(arraySize*sizeof(float)); b->SetCPUBufferPointer(a); b->Allocate(); // change values in CPU memory a[10] = 8; // mark GPU as dirty and synchronize CPU -> GPU b->SetGPUBufferDirty(); b->MakeUpToDate();

Data Manager Example (cont’d) // change values in GPU memory ... (run GPU kernel) // mark CPU as dirty and synchronize GPU -> CPU b->SetCPUBufferDirty(); b->MakeUpToDate();

Create Your Own GPU Data Manager GPUImageDataManager GPUMeshDataManager GPUVideoDataManager SetImagePointer() MakeCPUBufferUpToDate() MakeGPUBufferUpToDate() .... ....

GPU Image Derived from itk::Image GPUImageDataManager as a member Compatible to existing ITK filters GPUImageDataManager as a member Separate GPU implementation from Image class Implicit(automatic) synchronization Override CPU buffer access functions to properly set the dirty buffer flag Provide a single view of CPU/GPU memory

GPU Image itk::Image::GPUImage itk::Image FillBuffer() FillBuffer() GetPixel() SetPixel() GetBufferPointer() GetPixelAccessor() GetNeighborhoodAccessor() ... FillBuffer() GetPixel() SetPixel() GetBufferPointer() GetPixelAccessor() GetNeighborhoodAccessor() ... GPUImageDataManager CPU Memory SetGPUBufferDirty() MakeUpToDate()

GPU Kernel Manager Load and compile GPU source code Create GPU kernels LoadProgramFromFile() Create GPU kernels CreateKernel() Execute GPU kernels SetKernelArg() SetKernelArgWithImage() LaunchKernel()

Kernel Manager Example // create GPU images itk::GPUImage<float,2>::Pointer srcA, srcB, dest; srcA = itk::GPUImage<float,2>::New(); ... // create GPU kernel manager GPUKernelManager::Pointer kernelManager = GPUKernelManager::New(); // load program and compile kernelManager->LoadProgramFromFile( “ImageOps.cl”, "#define PIXELTYPE float\n" ); // create ADD kernel int kernel_add = kernelManager->CreateKernel("ImageAdd");

Kernel Manager Example(cont’d) unsigned int nElem = 256*256; // set kernel arguments kernelManager->SetKernelArgWithImage(kernel_add, 0, srcA->GetGPUDataManager()); kernelManager->SetKernelArgWithImage(kernel_add, 1, srcB->GetGPUDataManager()); kernelManager->SetKernelArgWithImage(kernel_add, 2, dest->GetGPUDataManager()); kernelManager->SetKernelArg(kernel_add, 3, sizeof(unsigned int), &nElem); // launch kernel kernelManager->LaunchKernel2D(kernel_add, 256, 256, 16, 16);

OpenCL Source Code Example // // pixel by pixel addition of 2D images __kernel void ImageAdd(__global const PIXELTYPE* a, __global const PIXELTYPE* b, __global PIXELTYPE* c, unsigned int nElem) {  unsigned int width = get_global_size(0);  unsigned int gix = get_global_id(0);  unsigned int giy = get_global_id(1);  unsigned int gidx = giy*width + gix;  // bound check  if (gidx < nElem)   {   c[gidx] = a[gidx] + b[gidx];  } }

GPUImageToImageFilter Base class for GPU image filter Extend existing itk filters using CRTP Turn on/off GPU filter IsGPUEnabled(bool) GPU filter implementation GPUGenerateData() template< class TInputImage, class TOutputImage, class TParentImageFilter > class ITK_EXPORT GPUImageToImageFilter: public TParentImageFilter { ... } CRTP: The Curiously Recurring Template Pattern

Create Your Own GPU Image Filter Step 1: Derive your filter from GPUImageToImageFilter using an existing itk image filter Step 2: Load and compile GPU source code and create kernels in the constructor Step 3: Implement filter by calling GPU kernels in GPUGenerateData()

Example: GPUMeanImageFilter Step 1: Class declaration template< class TInputImage, class TOutputImage > class ITK_EXPORT GPUMeanImageFilter : public GPUImageToImageFilter< TInputImage, TOutputImage, MeanImageFilter< TInputImage, TOutputImage > > { ... } MeanImageFilter as third template parameter to GPUImageToImageFilter class

Example: GPUMeanImageFilter Step 2: Constructor template< class TInputImage, class TOutputImage > GPUMeanImageFilter< TInputImage, TOutputImage>::GPUMeanImageFilter() { char buf[100]; // OpenCL source path char oclSrcPath[100]; sprintf(oclSrcPath, "%s/Code/GPU/GPUMeanImageFilter.cl", itk_root_path);   // load and build OpenCL program m_KernelManager->LoadProgramFromFile( oclSrcPath, buf ); // create GPU kernel m_KernelHandle = m_KernelManager->CreateKernel("MeanFilter"); } Defined in pathToOpenCLSourceCode.h itk_root_path is defined in pathToOpenCLSourceCode.h m_KernelManager is already created in the base class constructor (GPUImageToImageFilter::GPUImageToImageFilter())

Example: GPUMeanImageFilter Step 3: GPUGenerateData() template< class TInputImage, class TOutputImage > void GPUMeanImageFilter< TInputImage, TOutputImage >::GPUGenerateData() {   typedef itk::GPUTraits< TInputImage >::Type  GPUInputImage; typedef itk::GPUTraits< TOutputImage >::Type  GPUOutputImage;   // get input & output image pointer GPUInputImage::Pointer inPtr = dynamic_cast< GPUInputImage * >( this->ProcessObject::GetInput(0) ); GPUOutputImage::Pointer otPtr = dynamic_cast< GPUOutputImage * >( this->ProcessObject::GetOutput(0) ); GPUOutputImage::SizeType outSize = otPtr->GetLargestPossibleRegion().GetSize();   int radius[3], imgSize[3];   for(int i=0; i<(int)TInputImage::ImageDimension; i++)   {  radius[i] = (this->GetRadius())[i];     imgSize[i] = outSize[i]; } GPUTraits to get the type of GPUImage You must cast image to GPUImage (this is important because if you use object factory then the image pointer you use is not GPUImage. Get the filter radius from its parent class (MeanImageFilter)

size_t localSize[2], globalSize[2]; localSize[0] = localSize[1] = 16; (Continued..) size_t localSize[2], globalSize[2]; localSize[0] = localSize[1] = 16; globalSize[0] = localSize[0]*(unsigned int)ceil((float)outSize[0]/(float)localSize[0]); globalSize[1] = localSize[1]*(unsigned int)ceil((float)outSize[1]/(float)localSize[1]);   // kernel arguments set up   int argidx = 0; m_KernelManager->SetKernelArgWithImage(m_KernelHandle, argidx++, inPtr->GetGPUDataManager()); otPtr->GetGPUDataManager()); for(int i=0; i<(int)TInputImage::ImageDimension; i++)     m_KernelManager->SetKernelArg(m_KernelHandle, argidx++, sizeof(int), &(radius[i])); &(imgSize[i])); // launch kernel m_KernelManager->LaunchKernel(m_KernelHandle, (int)TInputImage::ImageDimension, globalSize, localSize); } globalsize must be a multiple of localsize

Big Picture: Collaboration Diagram GPU Kernel Manager GPU Image GPU Context Manager

Pipeline Support Allow combining CPU and GPU filters Efficient CPU/GPU synchronization Currently ImageToImageFilter is supported ReaderType::Pointer reader = ReaderType::New(); WriterType::Pointer writer = WriterType::New(); GPUMeanFilterType::Pointer filter1 = GPUMeanFilterType::New(); GPUMeanFilterType::Pointer filter2 = GPUMeanFilterType::New(); ThresholdFilterType::Pointer filter3 = ThresholdFilterType::New(); filter1->SetInput( reader->GetOutput() ); // copy CPU->GPU implicitly filter2->SetInput( filter1->GetOutput() ); filter3->SetInput( filter2->GetOutput() ); writer->SetInput( filter3->GetOutput() ); // copy GPU->CPU implicitly writer->Update(); Filter1 (GPU) Filter2 Filter3 (CPU) Reader Writer Synchronize Efficient synchronization : lazy evaluation

Complicated Filter Design Multiple kernel launches Single kernel, multiple calls Multiple kernels Design choices Each kernel is a filter, pipelining Reusable, memory overhead Put multiple kernels in a single filter Not reusable, less memory overhead

Object Factory Support Create GPU object when possible No need to explicitly define GPU objects // register object factory for GPU image and filter objects ObjectFactoryBase::RegisterFactory(GPUImageFactory::New()); ObjectFactoryBase::RegisterFactory(GPUMeanImageFilterFactory::New()); typedef itk::Image< InputPixelType,  2 >  InputImageType; typedef itk::Image< OutputPixelType, 2 >  OutputImageType; typedef itk::MeanImageFilter< InputImageType, OutputImageType > MeanFilterType; MeanFilterType::Pointer filter = MeanFilterType::New();

Type Casting Image must be casted to GPUImage for auto-synchronization for non-pipelined workflow with object factory Use GPUTraits template <class T> class GPUTraits { public:   typedef T   Type; }; template <class T, unsigned int D> class GPUTraits< Image< T, D > >   typedef GPUImage<T,D>   Type; InputImageType::Pointer img; typedef itk::GPUTraits< InputImageType >::Type GPUImageType;  GPUImageType::Pointer otPtr = dynamic_cast< GPUImageType* >( img );

Examples test_itkGPUImage.cxx test_itkGPUImageFilter.cxx Simple image algebra Multiple kernels and command queues test_itkGPUImageFilter.cxx GPUMeanImageFilter Pipeline and object factory ctest –R gpuImageFilterTest -V

ToDo List Multi-GPU support InPlace filter base class GPUThreadedGenerateData() InPlace filter base class Grafting for GPUImage GPUImage internal types Buffer, image (texture) Basic filters Level set, registration, etc

Useful Links Current code in gerrit OpenCL http://review.source.kitware.com/#change,800 OpenCL http://www.khronos.org/opencl/ http://www.nvidia.com/object/cuda_opencl_new.html http://developer.amd.com/zones/OpenCLZone/pages/default.aspx http://software.intel.com/en-us/articles/intel-opencl-sdk/

Questions?