Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes.

Slides:



Advertisements
Similar presentations
Implementation of Voxel Volume Projection Operators Using CUDA
Advertisements

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Fast and Accurate Voxel Projection Technique in Free-Form Cone-Beam Geometry With Application to Algebraic Reconstruction Mikko Lilja.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Introduction to Parallel Computing
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Overview Introduction Variational Interpolation
BMME 560 & BME 590I Medical Imaging: X-ray, CT, and Nuclear Methods Tomography Part 3.
BMME 560 & BME 590I Medical Imaging: X-ray, CT, and Nuclear Methods Tomography Part 4.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
By Yequn Zhang, Yu Zhang. Contents Introduction Problem Analysis Proposed Algorithm Evaluation.
Ger man Aerospace Center Gothenburg, April, 2007 Coding Schemes for Crisscross Error Patterns Simon Plass, Gerd Richter, and A.J. Han Vinck.
Introduction to Longitudinal Phase Space Tomography Duncan Scott.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.
Using CUDA for Solar Thermal Plant Computation. Background Problem Solution Algorithm Polygon Clipping Why CUDA? Progress.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Lecture 3 : Direct Volume Rendering Bong-Soo Sohn School of Computer Science and Engineering Chung-Ang University Acknowledgement : Han-Wei Shen Lecture.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
Implementing a Speech Recognition System on a GPU using CUDA
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Medical Image Analysis Image Reconstruction Figures come from the textbook: Medical Image Analysis, by Atam P. Dhawan, IEEE Press, 2003.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
QCAdesigner – CUDA HPPS project
M. Jędrzejewski, K.Marasek, Warsaw ICCVG, Multimedia Chair Computation of room acoustics using programable video hardware Marcin Jędrzejewski.
Reconstruction of Solid Models from Oriented Point Sets Misha Kazhdan Johns Hopkins University.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
SIFT DESCRIPTOR K Wasif Mrityunjay
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Statistical Methods for Image Reconstruction
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
Performed by:Liran Sperling Gal Braun Instructor: Evgeny Fiksman המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
Blocked 2D Convolution Ravi Sankar P Nair
Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016.
GPU-based iterative CT reconstruction
Hardware Implementation of CTIS Reconstruction Algorithms
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Implementation of Efficient Check-pointing and Restart on CPU - GPU
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Tianfang Li Quantitative Reconstruction for Brain SPECT with Fan-Beam Collimators Nov. 24th, 2003 SPECT system: * Non-uniform attenuation Detector.
CS/EE 217 – GPU Architecture and Parallel Programming
GENERAL VIEW OF KRATOS MULTIPHYSICS
© 2012 Elsevier, Inc. All rights reserved.
Convolution Layer Optimization
6- General Purpose GPU Programming
Data Parallel Computations and Pattern
Data Parallel Computations and Pattern
Presentation transcript:

Parallelization of System Matrix generation code Mahmoud Abdallah Antall Fernandes

SPECT System

Inverse Cone

Back Projection Ref figure: Tomographic Reconstruction of SPECT DataTomographic Reconstruction of SPECT Data – Bill Amini, Magnus Björklund, Ron Dror, Anders Nygren oo Filtered Back Projection is applying a ramp filter on the back projected image. Still widely used for its high speed and easy implementation.

Maximum Likelihood-Expectation Maximization Algorithm Is found to reduce noise in reconstruction iteratively An iterative algorithm is used to solve the following linear problem FX = P P – vector of projection data X – voxelized image F – projection matrix operator Needs a large number of iterations to reconstruct an image

EM Algorithm The EM algorithm is given by Summation over k is projection operation Summation over j is the back projection operation

System Matrix Maps the image space to the data space Takes detector geometry as input Generates detector data for every bin for each angle (usually there are 72 angles/frames)

System Matrix Algorithm for each angle DO // number of angles = 72 for each detector bin in U direction Do // bins: around 14 for each detector bin in V direction Do // bins: around 64 for each row in the inverse cone grid Do // <= 99 for each Column in the inverse cone grid Do //<= 99 for each voxel intersected the Ray Do calculate point response end Number of loops = 72 x 14 x 64 x 99 x 99 =

System Matrix Parallelization Observation: At each angle, each bin’s calculations are independent from other bins’. Proposal: Parallelize all calculations for each angle. E.g. use GPU.

System Matrix Parallelization on GPU

Parallelized System Matrix Algorithm Host Program: for each angle DO Run all kernels for all bins at the same time end GPU Kernel: for each voxel intersected the Ray Do calculate attenuation and store it in SysMat end

SIMD (Architecture of GPU) From: (AMD) Advanced Micro Devices INC 2010 (Introduction to OpenCL Programming)

OpenCL Based on ISO C99 with some extensions & restrictions provides parallel computing using task-based and data-based parallelism Architecture Host Program Kernel

Program Architecture Host Program Executes on the host system Sends kernels to execute on OpenCL™ devices using command queue. Kernels Similar to C function. Executed on OpenCL™ devices ( GPU).

Thank You