GPU-based iterative CT reconstruction

Slides:



Advertisements
Similar presentations
Implementation of Voxel Volume Projection Operators Using CUDA
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
tomos = slice, graphein = to write
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Medical Image Analysis Image Reconstruction Figures come from the textbook: Medical Image Analysis, by Atam P. Dhawan, IEEE Press, 2003.
Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
GPU Architecture and Programming
Geant4 based simulation of radiotherapy in CUDA
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
QCAdesigner – CUDA HPPS project
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
CS/EE 217 – GPU Architecture and Parallel Programming
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Stencil-based Discrete Gradient Transform Using
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Computed Tomography Image Reconstruction
tomography: from medicine to accelerator physics, and back again
GPU VSIPL: High Performance VSIPL Implementation for GPUs
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Lecture 2: Intro to the simd lifestyle and GPU internals
Speedup over Ji et al.'s work
NVIDIA Fermi Architecture
General Programming on Graphical Processing Units
CS/EE 217 – GPU Architecture and Parallel Programming
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
General Programming on Graphical Processing Units
Computer Graphics Graphics Hardware
GPU accelerated application tracing
Automatic Generation of Warp-Level Primitives and Atomic Instruction for Fast and Portable Parallel Reduction on GPU Simon Garcia De Gonzalo and Sitao.
6- General Purpose GPU Programming
Computed Tomography.
Presentation transcript:

GPU-based iterative CT reconstruction Tamás Huszár Zsolt Adam Balogh Tamas.huszar@mediso.com Zsolt.Balogh@mediso.com R & D Mediso Ltd. 20-21 May 2015 Budapest, Hungary 5th GPU WORKSHOP - THE FUTURE OF MANY-CORE COMPUTING IN SCIENCE 1

Computed Tomography 2

CUDA CUDA (Compute Unified Device Architecture) A parallel computing platform and programming model invented by NVIDIA. It enables increases in computing performance by exploiting the power of the graphics processing unit (GPU). Hardware Kepler microarchitecture GeForce GTX TITAN Z (compute capacity 3.5, memory 12 Gb, CUDA cores 5760) GeForce GTX TITAN Black (compute capacity 3.5, memory 6 GB, CUDA cores 2688) Maxwell microarchitecture GeForce GTX 980 (compute capacity 5.2, memory 4 GB, CUDA cores 2048) GeForce GTX TITAN X (compute capacity 5.2, memory 12 GB, CUDA cores 3072) 3 3

CUDA CUDA framework Wraps and hides CUDA functionality Enables fast development Features Smart Device selection Occupancy calculations Kernel launch Memory management 4

CUDA CUDA framework Smart device selection GPU filters (available memory, display cards, …) Get connected displays with NvAPI Occupancy calculations and kernel launch Calculate block size for highest occupancy Semi-automatic with CUDA 6.5 occupancy API Kernel launch macros to hide thread micromanagement 5

CUDA CUDA framework Kernel configuration macros Kernel launch macros Use cudaFuncGetAttributes to find register and shared memory usage cudaOccupancyMaxPotentialBlockSize kernel info + data size  blockSize3D, gridSize3D Kernel launch macros Sync / Async mode for debug Profiling / logging Thread index calculator for the CUDA kernel program Easy indexing for 2D and 3D image arrays using block and grid configuration Kernels must avoid over-indexing 6 6

CUDA CUDA framework / Example Kernel launch: myKernel: ADD_CONF(myKernel, 512*512*256); KERNEL_LAUNCH_ASYNC( myKernel, config["myKernel"], dataSize, parameters, ...); myKernel: uint tid = index(); uint2 pos = getXY( tid, dataSize.width ); data[pos.x, pos.y] = ...; 7 7

CUDA CUDA framework / Device memory allocation Generalized datastructures for different types array, layered array, global memory, pitched memory 1D, 2D, 3D float, double, … Transparent multi-GPU support Examples: template<typename T> DeviceMemoryArray2D<T> template<typename T> DeviceMemoryLinear3D<T> template<typename T> DeviceMemoryLayered3D<T> 8 8

CUDA Dynamic memory allocation example DeviceMemoryLinear3D<float> volume1(gpu1); DeviceMemoryArray3D<float> volume2(gpu2); volume2.copy(volume1); instead of cudaMemcpy3DParms parms; memset(&parms, 0, sizeof(cudaMemcpy3DParms)); parms.srcPos = srcPos; parms.dstPos = dstPos; parms.srcPtr = make_cudaPitchedPtr(devPtr, pitch, x, y); parms.dstArray = array; parms.extent = extent; parms.kind = cudaMemcpyDeviceToDevice; cudaMemcpy3DAsync( &parms, 0 ); 9 9

Iterative CT reconstruction Iterative CT generations 1st generation Filtered Back Projection 2nd generation Image-space Iterative Reconstruction 3rd generation Statistical Iterative Reconstruction 4th generation Model-based Reconstruction 10

Iterative CT reconstruction Forward-projection Projections FBP Start image Iterated projections Iterated image Result Back-projection 11

Iterative CT reconstruction Iterative reconstruction types Algebraic reconstruction techniques (ART, SART, SIRT) solving y=Ax, where y is sinogram data, A is system model Statistical reconstruction methods (EM, ML) (I ~ Poisson + Normal) Model-based methods (go beyond modeling statistics of the detected photons as Poisson distributed) 12 12

Iterative CT reconstruction Maximalization of the conditional probability P( μ | y ), where μ is the distribution of linear attenuation coefficients y is the vector of transmission measurements. Backprojection is the value of pixel j, is the measured sinogram along line i, is the intersection length of projection line i with pixel j. 13 13

Iterative CT reconstruction Reconstructed volume (FBP and iterative) 14 14

Iterative CT reconstruction Advantages image noise reduction dose reduction artifact reduction Disadvantages computation time (using GPU) model complexity software complexity, non-linear algorithms 15 15

Iterative CT reconstruction Projector types (parallelization) pixel-driven (ray-driven) voxel-driven distance-driven 16 16

Iterative CT reconstruction Optimizations CPU vs. GPU Memory access patterns Coalesced access Texture memory vs. global memory Use float instead of double where possible Memory size Multi-GPU 17

Measurements Reconstruction speed 18

Measurements Reconstruction speed 19

Measurements Reconstruction speed 20

Measurements Reconstruction speed 21

Measurements Titan X speedup 22

Thank you for your attention! 23 23