Implementation of Voxel Volume Projection Operators Using CUDA

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

Fast and Accurate Voxel Projection Technique in Free-Form Cone-Beam Geometry With Application to Algebraic Reconstruction Mikko Lilja.
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
An Optimized Soft Shadow Volume Algorithm with Real-Time Performance Ulf Assarsson 1, Michael Dougherty 2, Michael Mounier 2, and Tomas Akenine-Möller.
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Direct Volume Rendering. What is volume rendering? Accumulate information along 1 dimension line through volume.
More on threads, shared memory, synchronization
ARM-DSP Multicore Considerations CT Scan Example.
Chapter 4 Systems of Linear Equations; Matrices Section 2 Systems of Linear Equations and Augmented Matrics.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Surface Reconstruction from 3D Volume Data. Problem Definition Construct polyhedral surfaces from regularly-sampled 3D digital volumes.
Splatting Josh Anon Advanced Graphics 1/29/02. Types of Rendering Algorithms Backward mapping Image plane mapped into data Ray casting Forward mapping.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
Now Playing: My Mathematical Mind Spoon From Gimme Fiction Released May 10, 2005.
Final Gathering on GPU Toshiya Hachisuka University of Tokyo Introduction Producing global illumination image without any noise.
Parallel Computation of the Minimum Separation Distance of Bezier Curves and Surfaces Lauren Bissett, Nicholas Woodfield,
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
GPGPU platforms GP - General Purpose computation using GPU
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Optimizing Katsevich Image Reconstruction Algorithm on Multicore Processors Eric FontaineGeorgiaTech Hsien-Hsin LeeGeorgiaTech.
Implementing a Speech Recognition System on a GPU using CUDA
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Large-scale Deep Unsupervised Learning using Graphics Processors
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Volume Rendering CMSC 491/635. Volume data  3D Scalar Field: F(x,y,z) = ?  Implicit functions  Voxel grid  Scalar data  Density  Temperature  Wind.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Single-Slice Rebinning Method for Helical Cone-Beam CT
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Variations on Backpropagation.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
CS/EE 217 – GPU Architecture and Parallel Programming
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
Stencil-based Discrete Gradient Transform Using
GPU-based iterative CT reconstruction
Image Transformation 4/30/2009
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
NVIDIA Fermi Architecture
CS/EE 217 – GPU Architecture and Parallel Programming
Memory System Performance Chapter 3
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Force Directed Placement: GPU Implementation
Presentation transcript:

Implementation of Voxel Volume Projection Operators Using CUDA Applications to Iterative Cone-Beam CT reconstruction

Three questions to keep in mind during the presentation Name three differences between the GPU and CPU contributing to the observed performance improvements. Scattered accumulation (a[ix] = a[ix] + b) can result in reduced performance and unexpected results in CUDA. Why? Please state two possible problems! Suppose that each thread in a CUDA program is responsible for writing to one unique output element. Give one possible solution for handling ”odd sized” output data.

What are voxel volume projectors? Detector p f X-ray source Object: f Projection data: p (line integrals through f) Forward projection: P, mapping f to p Backprojection: PT, mapping p to f

The tomography problem Given p, find an image/volume f, such that the difference between Pf and p is small in some sense, e.g. z(f)=||Pf-p||2 attains a minimum. Steepest descent approach: Gradient: z(f)=2PT(Pf-p) Update step: fk+1=fk-2aPT(Pfk-p) We use this method only for illustration of why P and PT are needed. In practice, faster and better regularized methods are needed.

Why implement P and PT on the GPU? Reasonable sizes of f and p are around 2GB, implying a size of 2GBx2GB for P and PT. Although these matrices are sparse, the number of nonzero elements is approximately 2000GB, i.e., matrix-vector products involving these matrices are computationally demanding. The GPU offers many solutions that help speeding up the computations: Massive parallelism Hardware linear interpolation Very high memory bandwidth Texture caches ”optimized for 2D locality”

The Joseph forward projector [1] Illustration in the 2D case: Generalization to 3D is straightforward. Instead of linear intepolation, bilinear interpolation is used. pi= L (x0+x1+x2+x3) x3 x2 x1 x0 L [1] Joseph, P. M., An improved algorithm for reprojecting rays through pixel images, IEEE Transactions on Medical Imaging, 1982, MI-1, 192-196

Implementation sketch for P We choose to let one thread correspond to one ray. For each ray/thread: Determine the ray origin and direction. Determine the ray and voxel volume intersection. Step along the ray and accumulate values from the voxel volume, using the 3D-texture cache. Multiply with L and store to output element.

Handling of ”odd” detector sizes The x-ray detector is divided into 2D blocks corresponding to CUDA thread blocks, using a static block size (16x2). To handle the case with detector dimensions not divisible with the block size, conditional statements were used: Although this reduces efficiency due to divergent programs, the reduction is small for detectors with reasonably large number of channels and rows. // Return if outside detector if (rowIx>=Nrows) return; if (chIx>=Nch) return;

Implementation details 1 Calc. of source and detector positions // Calculate focus center and displacement vectors float alpha = theta + D_ALPHA(chIx); float3 fc = make_float3(Rf*cos(alpha), Rf*sin(alpha), z0+D_Z(chIx)); float3 fw = make_float3(-sin(alpha), cos(alpha), 0.f); float3 fh = make_float3(-cos(alpha)*sin(aAngle), -sin(alpha)*sin(aAngle), cos(aAngle)); // Calculate detector center and displacement vectors float3 dc = fc + (Rf+Rd) * make_float3(-cos(theta), -sin(theta), D_SLOPE(rowIx)); float3 dw = make_float3(sin(theta), -cos(theta), 0.f); float3 dh = make_float3(0.f, 0.f, 1.f); // Calculate focus position in texture index coordinates float3 f = ((fc + fOffset.x*fw + fOffset.y*fh) - origin) / spacing; // Calculate detector position in texture index coordinates float3 d = ((dc + dOffset.x*dw + dOffset.y*dh) - origin) / spacing; // Create ray struct Ray ray; ray.o = f; ray.d = normalize(d - f); Code based on NVIDIA CUDA SDK ”volumeRender”

Implementation details 2 Accumulation of contributions // Accumulate contributions along ray float dValue = 0; float t = tnear; for(int i=0; i<dimensions.x; i++) { // Calculate current position float3 pos = ray.o + ray.d*t + 0.5f; // texture origin: (0.5f, 0.5f, 0.5f) // read from 3D texture dValue += tex3D(devT_reconVolume, pos.x, pos.y, pos.z); // increase t t += tstep; if (t >= tfar) break; } // Update detector data value dValue *= length((ray.d*spacing))*tstep; projectionData[rowIx*Nch+chIx] += dValue; Code based on NVIDIA CUDA SDK ”volumeRender”

Experiments – forward projection (P) Input data dimensions: 672x24x3000 floats Output data dimensions: 512x512x257 floats Only very small differences in accuracy, without practical implications, occur between CPU and GPU implementations. Calculation times CPU: 2500 s GPU: 47 s Speed up factor: approximately 50x For a larger collection of problems, speedups between 20x and 50x have been observed.

Efficient implementation of the exact adjoint operator PT is much trickier, why? 2D illustration of the procedure: Same interpolation coefficients as for P, but with scattered accumulation instead of reading. No hardware interpolation. No 2D/3D textures. New parallelization setup is needed. One ray/one thread leads to corrupted results. pi

One slow but exact implementation for the exact transpose of P Let one thread represent only one accumulation from a ray, i.e., calculation of position in voxel volume. calculation and multiplication with interpolation coefficients. accumulation to the 4 voxels closest to the active ray/voxel volume plane intersection. Very short, and very many rays threads. One thread block represents a number of rays, separated so that conflicting updates do not occur: Block 0 Block 1 Block 2 ...

Experiments – backprojection (PT) Input data dimensions: 512x512x257 floats Output data dimensions: 672x24x3000 floats Calculation times CPU: 2700 s GPU: 700 s Speed up factor: approximately 4x For a larger collection of problems, speedups between 4x and 8x have been observed.

Approximate implementation of the adjoint operator PT Bilinear interpolation on the detector. 2D-illustration: Parallelization is now accomplished by letting one pixel correspond to one thread. This method is approximate since the interpolation coefficients generally are different from those of PT.

Performance comparison Approximate versus exact PT operator Calculation times CPU PT : 2700s GPU PT (exact): 700s GPU PT (approximate): 60s Axial images of Head phantom reconstructions (25HU-75HU): Exact Approximate

Conclusions and future research For operations such as P and PT, speedups of 4x to 50x can be obtained. The amount of speedup is highly dependent on The possibility to efficiently read memory (coalesced or by the use of texture caches / constant memory). The possibility to use hardware interpolation Complexity of the kernel. Using too many registers slows down the program. Lookup tables in constant memory can help reducing the amount of registers. Although scattered write is supported by CUDA, for these problemes, it did not seem good to use it from a performance point of view. Note that this prohibits using cudaArray textures. Remark: Even if the approximate PT operator give a bad result in the example given here, there are other situations where the exact operator is superior. It is therefore of interest to find better approximations.