Stencil-based Discrete Gradient Transform Using

Slides:

Advertisements

Similar presentations

Implementation of Voxel Volume Projection Operators Using CUDA

Advertisements

Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.

Optimization on Kepler Zehuan Wang

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Institute of Medical Engineering 1 20th Annual International Conference on Magnetic Resonance Angiography Graz, Real Time Elimination of.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

Maurizio Conti, Siemens Molecular Imaging, Knoxville, Tennessee, USA

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Parity Logging O vercoming the Small Write Problem in Redundant Disk Arrays Daniel Stodolsky Garth Gibson Mark Holland.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

EE369C Final Project: Accelerated Flip Angle Sequences Jan 9, 2012 Jason Su.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

QCAdesigner – CUDA HPPS project

By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.

GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering Group 14 William Yang and David Li Presenter: William Yang Mentor: Dr. Rogerio.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Chapter-4 Single-Photon emission computed tomography (SPECT)

MAIN PROJECT IMAGE FUSION USING MATLAB

Authors: Jiang Xie, Ian F. Akyildiz

Chapter 10: Computer systems (1)

GPU-based iterative CT reconstruction

REGISTER TRANSFER LANGUAGE (RTL)

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

You Zhang, Jeffrey Meyer, Joubin Nasehi Tehrani, Jing Wang

Interior Tomography Approach for MRI-guided

Parallel Plasma Equilibrium Reconstruction Using GPU

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

What is GPU? how does it work?

T. Chernyakova, A. Aberdam, E. Bar-Ilan, Y. C. Eldar

Lecture 2: Intro to the simd lifestyle and GPU internals

TechnoSpecialist Computers Information Package

APPLICATIONS OF MATRICES APPLICATION OF MATRICES IN COMPUTERS Rabab Maqsood (069)

NVIDIA Fermi Architecture

Development of crosshole GPR data full-waveform inversion and a real data test at the Boise Hydrogeophysics Research Site Good morning and thank you for.

Kiran Subramanyam Password Cracking 1.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Multithreaded Programming

6- General Purpose GPU Programming

Presentation transcript:

Stencil-based Discrete Gradient Transform Using Paper_97 Stencil-based Discrete Gradient Transform Using GPU Device in Compressed Sensing MRI Xuelin (Nick) Cui and Hongbin Guo FMI Medical Systems, Inc., 29001 Solon Rd., Solon, OH, U.S.A Email of corresponding author: nick.cui@fmimaging.com Introduction Results The total variation (TV) has been widely incorporated in iterative image reconstruction using compressed sensing (CS) technique since its inception. Computation of TV requires computation of discrete gradient transform (DGT) of an image, and in order to pursue superior image quality, high-order calculation is employed here. However, high-order gradient computation is time consuming especially when programmed in central processing unit (CPU) code. Recently, graphic processing unit (GPU) has brought revolutionary changes in clinical context due to its extremely powerful parallelism of computation. A GPU simultaneously executes same set of instructions (single instruction; multiple data or SIMD) with its massive number of threads. In this work we demonstrate a stencil-based implementation of high-order DGT using shared memory can achieve exceptional image quality with fast computation in CS MRI. The reconstruction framework is tested on a MRI head image. The implementation is based on Microsoft Windows system PC equipped with a nVidia K2000M GPU with 2 GB memory. The sampling rate of MRI k-spcae is 30%. The test is conducted by comparing results from analytic reconstruction, CS reconstruction using regular DGT, and CS reconstruction using stencil-based high-order DGT. The results are shown in Fig. 3. Top left panel shows the original image as truth. Top right panel shows analytic reconstruction. Bottom left panel shows the CS MRI reconstruction with regular DGT. Bottom right shows the high-order stencil-based CS reconstruction. It is clear that the stencil-based method gives enhanced image precision with less artifacts and better contrast and resolution. A zoom-in of region of interest (ROI) is displayed in Fig. 4. The analytic method shows strong noise brain tissue. The regular DGT-based method shows less noise, but the detail of the anatomy is significantly lost. On the other hand, the stencil-based method shows distinct improvement comparing to both inverse Fourier-based method and regular DGT-based method. Fig. 5 shows halfway profile comparison between results, where stencil-based high order DGT method gives the best approximation to the truth. Fig. 6 shows stencil-based GPU implementation of higher-order DGT using shared memory converges significantly faster than CPU implementation. Fig. 3. Recon image comparison. Fig. 4. ROI comparison with details. Method The MRI system can be mathematically modeled as where is the system matrix, is the unknown image and is data. The TV-based CS model can be described as following equation [1] which can be further converted to a Lagrange-typed objective function In this optimization framework, TV prior is the key to govern image quality. Therefore, instead of using forward or backward finite difference, the stencil-based DGT is calculated here with the 8th order approximation to pursue exceptional precision. In particular, The computation of DGT of an image at location is defined as Where [2]. One of the major bottlenecks in GPU usage is to access global memory repeatedly and frequently. As described in Fig. 1, global memory is at the highest level and is slower than any other types of GPU memories. Therefore, we use shared memory here to store (2M +1)-point stencil with uniform spacing in the direction. Fig. 2 illustrates how an image is divided in to a group of tiles with stencil-sized shared memory surrounding a particular tile. In this configuration, the global memory (the image) is only loaded once during the entire process of computation. The shared memory is much faster than global memory and almost as fast as local registers. Stencil sizes are optimized by experiments such that coalescing on GPU device is achieved. The optimal stencil size in this work is 4, and it handles typical clinical image size such as or . Fig. 5. Half-way profile comparison . Fig. 6. GPU Vs. CPU in performance. Fig. 2. An image is divided into tiles sized N X N with stencil size 2M + 1. Data in tile neighboring region are cached in shared memory Fig. 1. Diagram of parallelism architecture of CUDA and its architecture of memory. Conclusion High-order DGT can improve image quality in iterative based reconstruction with undersampled clinical data. GPU implementation of high-order DGT using stencil-based shared memory can significantly improve computational performance comparing to conventional implementation. References 1. X. Cui, H. Yu, G. Wang, and L. Mili, “Total variation minimizationbased multimodality medical image reconstruction,” Proc. SPIE 9212, Developments in X-Ray Tomography IX, vol. 9212, p. 11, 2014. 2. B. Fornberg, “Generation of finite difference formulas on arbitrarily spaced grids,” Mathematics of Computation, vol. 51, no. 184, pp. 699– 706, 1988. 3. Nvidia. (2016) Nvidia cuda c programming guide. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/