CUDA ITK Won-Ki Jeong SCI Institute University of Utah.

CUDA ITK Won-Ki Jeong SCI Institute University of Utah

NVIDIA G80 New architecture for computing on the GPUNew architecture for computing on the GPU –GPU as massively parallel multithreaded machine One step further from streaming modelOne step further from streaming model –New hardware features Unified shaders (ALUs)Unified shaders (ALUs) Flexible memory access (scatter)Flexible memory access (scatter) Fast user-controllable on-chip memoryFast user-controllable on-chip memory Integer, bitwise operationsInteger, bitwise operations

NVIDIA CUDA C-extension NVIDIA GPU programming languageC-extension NVIDIA GPU programming language –No graphics API overhead –Easy to learn –Support development tools Extensions / APIExtensions / API –Function type : __global__, __device__, __host__ –Variable type : __shared__, __constant__ –cudaMalloc(), cudaFree(), cudaMemcpy(),… –__syncthread(), atomicAdd(),… Program typesProgram types –Device program (kernel) : run on the GPU –Host program : run on the CPU to call device programs

CUDA ITK ITK powered by CUDAITK powered by CUDA –Many registration / image processing functions are still computationally expensive and parallelizable –Current ITK parallelization is bound by # of CPUs (cores) Our approachOur approach –Implement several well-known ITK image filters using NVIDIA CUDA –Focus on 3D volume processing CT / MRI datasets are mostly 3D volumeCT / MRI datasets are mostly 3D volume

CUDA ITK CUDA code is integrated into ITKCUDA code is integrated into ITK –Transparent to the itk users –No need to modify current code using ITK Check environment variable ITK_CUDACheck environment variable ITK_CUDA –Entry point : GenerateData() or ThreadedGenerateData() –If ITK_CUDA == 0 Execute original ITK codeExecute original ITK code –If ITK_CUDA == 1 Execute CUDA codeExecute CUDA code

ITK image space filters Convolution filtersConvolution filters –Mean filter –Gaussian filter –Derivative filter –Hessian of Gaussian filter Statistical filterStatistical filter –Median filter PDE-based filterPDE-based filter –Anisotropic diffusion filter

Speed up using CUDA Mean filter : ~ 140xMean filter : ~ 140x Median filter : ~ 25xMedian filter : ~ 25x Gaussian filter : ~ 60xGaussian filter : ~ 60x Anisotropic diffusion : ~ 70xAnisotropic diffusion : ~ 70x

Convolution filters Separable filterSeparable filter –N-dimensional convolution = N*1D convolution –For filter radius r, ExampleExample –2D Gaussian = 2 * 1D Gaussian

GPU implementation Apply 1D convolution along each axisApply 1D convolution along each axis –Minimize overlapping kernel * Shared memory Input (global memory)Output (global memory)

Minimize overlapping Usually kernel width is large ( > 20 for Gaussian)Usually kernel width is large ( > 20 for Gaussian) –Max block size ~ 8x8x8 –Each pixel has 6 neighbors in 3D Use long and thin blocks to minimize overlappingUse long and thin blocks to minimize overlapping 1 1 1 1 2 2 2 24 Multiple overlapping No overlapping 1 1 1 1

Median filter Viola et al. [VIS 03]Viola et al. [VIS 03] –Finding median by bisection of histogram bins –Log(# bins) iterations (e.g., 8-bit pixel : 8 iterations) 14318210 164 14318210 115 14318210 14318210 1. 2. 3. 4. 01234567 Intensity :

Pseudo code (GPU median filter) Copy current block from global to shared memory min = 0; max = 255; pivot = (min+max)/2.0f; For(i=0; i<8; i++) { count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: } if(count < kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f; } return floor(pivot);

Perona & Malik anisotropic PDE Nonlinear diffusionNonlinear diffusion –Fall-off function c (conductance) controls anisotropy –Less smoothing across high gradient –Contrast parameter k Numerical solutionNumerical solution –Euler explicit integration (iterative method) –Finite difference for derivative computation

Gradient & Conductance map Half x / y / z direction gradients / conductance for each pixelHalf x / y / z direction gradients / conductance for each pixel 2D example2D example –For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory required (n+1)*(n+1) * 4 (grad x, grad y, cond x, cond y) n*n (n+2)*(n+2) Global memory Shared memory

Euler integration Use pre-computed gradients and conductanceUse pre-computed gradients and conductance –Each gradient / conductance is used twice –Avoid redundant computation by using pre- computed gradient / conductance map

Experiments Test environmentTest environment –CPU : AMD Opteron Dual Core 1.8GHz –GPU : Tesla C870 Input volume is 128^3Input volume is 128^3

Result Mean filterMean filter Gaussian filterGaussian filter Kernel size 3579 ITK1.032.137.1718.5 CUDA0.07050.050.080.132 Speed up 134186140 Variance1248ITK0.7731.071.362.12 CUDA0.02790.03160.03170.0327 27334264

Result Median filterMedian filter Anisotropic diffusionAnisotropic diffusion Kernel size 3579 ITK1.034.1814.123.1 CUDA0.07050.2320.5441.07 Speed up 14182521 Iteration24816ITK3.216.3712.725.5 CUDA0.07150.1060.1720.306 44607383

Summary ITK powered by CUDAITK powered by CUDA –Image space filters using CUDA –Up to 140x speed up Future workFuture work –GPU image class for ITK Reduce CPU to GPU memory I/OReduce CPU to GPU memory I/O Pipelining supportPipelining support –Image registration –Numerical library (vnl) –Out-of-GPU-core processing Seismic volumes (~10s to 100s GB)Seismic volumes (~10s to 100s GB)

Questions?

CUDA ITK Won-Ki Jeong SCI Institute University of Utah.

Similar presentations

Presentation on theme: "CUDA ITK Won-Ki Jeong SCI Institute University of Utah."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CUDA ITK Won-Ki Jeong SCI Institute University of Utah.

Similar presentations

Presentation on theme: "CUDA ITK Won-Ki Jeong SCI Institute University of Utah."— Presentation transcript:

Similar presentations

About project

Feedback