CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge.

Slides:



Advertisements
Similar presentations
There is a pattern for factoring trinomials of this form, when c
Advertisements

Slide 1 Insert your own content. Slide 2 Insert your own content.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Combining Like Terms. Only combine terms that are exactly the same!! Whats the same mean? –If numbers have a variable, then you can combine only ones.
Numeric Types & Ranges. ASCII Integral Type Numerical Inaccuracies Representational error – Round-off error – Caused by coding a real number as a finite.
Multiplying monomials & binomials You will have 20 seconds to answer the following 15 questions. There will be a chime signaling when the questions change.
0 - 0.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
Addition Facts
HOW TO COMPARE FRACTIONS
Computer Science Education
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
L9: CUDA-CHiLL Research and Introduction to Dense Linear Algebra CS6235.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
List Ranking and Parallel Prefix
Reconstruction from Voxels (GATE-540)
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
5.9 + = 10 a)3.6 b)4.1 c)5.3 Question 1: Good Answer!! Well Done!! = 10 Question 1:
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Properties of Exponents
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
A Synergetic Approach to Throughput Computing on IA Chi-Keung (CK) Luk TPI/DPD/SSG Intel Corporation Nov 16, 2010.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Week 1.
FIND THE AREA ( ROUND TO THE NEAREST TENTHS) 2.7 in 15 in in.
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.
Senem KUMOVA METİN CS FALL 1 POINTERS && ARRAYS CHAPTER 6.
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
Chapter 3 โพรเซสเซอร์และการทำงาน The Processing Unit
CPSC 330 Fall 1999 HW #1 Assigned September 1, 1999 Due September 8, 1999 Submit in class Use a word processor (although you may hand-draw answers to Problems.
Tutorial 2 IDE for ARM 7 board (2). Outline Introduce the Debug mode of uVision4 2.
Medical Image Registration Kumar Rajamani. Registration Spatial transform that maps points from one image to corresponding points in another image.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
8. Geometric Operations Geometric operations change image geometry by moving pixels around in a carefully constrained way. We might do this to remove distortions.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Texture Memory -in CUDA Perspective TEXTURE MEMORY IN - IN CUDA PERSPECTIVE VINAY MANCHIRAJU.
CIS 565 Fall 2011 Qing Sun
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Statistical Parametric Mapping Lecture 11 - Chapter 13 Head motion and correction Textbook: Functional MRI an introduction to methods, Peter Jezzard, Paul.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Killdevil Running CUDA programs on cluster. Requesting permission bin/unc_id/services bin/unc_id/services.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© NIH National Center for Image-Guided Therapy, June 2008 Slicer IGT Nobuhiko Hata, PhD Sandy Wells, PhD Computation Core, NCIGT.
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CS427 Multicore Architecture and Parallel Computing
University of Ioannina
Basic CUDA Programming
Computational Neuroanatomy for Dummies
Image Processing, Lecture #8
Image Processing, Lecture #8
Presentation transcript:

CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge The problem CT, MRI, PET and Ultrasound produce 3D volume images Typically 256 x 256 x 256 = 16,777,216 image voxels. Combining modalities (inter modality) gives extra information. Repeated imaging over time same modality, e.g. MRI, (intra modality) equally important. Have to spatially register the images.

CUDA Image Registration 29 Oct 2008 Richard Ansorge Example – brain lesion CT MRI PET

CUDA Image Registration 29 Oct 2008 Richard Ansorge PET-MR Fusion The PET image shows metabolic activity. This complements the MR structural information

CUDA Image Registration 29 Oct 2008 Richard Ansorge Registration Algorithm Transform Im B to match Im A Im A Im B′ Im B Compute Cost Function Done Update transform parameters Yes No good fit? NB Cost function calculation dominates for 3D images and is inherently parallel

CUDA Image Registration 29 Oct 2008 Richard Ansorge Transformations General affine transform has 12 parameters: Polynomial transformations can be useful for e.g. pin- cushion type distortions: Local, non-linear transformations, e.g using cubic BSplines, increasingly popular, very computationally demanding.

CUDA Image Registration 29 Oct 2008 Richard Ansorge We tried this before

CUDA Image Registration 29 Oct 2008 Richard Ansorge Now - Desktop PC - Windows XP Needs 400 W power supply

CUDA Image Registration 29 Oct 2008 Richard Ansorge Free Software: CUDA & Visual C++ Express

CUDA Image Registration 29 Oct 2008 Richard Ansorge Visual C++ SDK in action

CUDA Image Registration 29 Oct 2008 Richard Ansorge Visual C++ SDK in action

CUDA Image Registration 29 Oct 2008 Richard Ansorge Architecture

CUDA Image Registration 29 Oct 2008 Richard Ansorge 9600 GT Device Query Current GTX 280 has 240 cores!

CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply from SDK NB using 4-byte floats

CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply (from SDK)

CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply (from SDK)

CUDA Image Registration 29 Oct 2008 Richard Ansorge Matrix Multiply (from SDK)

CUDA Image Registration 29 Oct 2008 Richard Ansorge Image Registration CUDA Code

CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host }

CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host } texture tex1; __constant__ float c_aff[16]; tex1: moving image, stored as 3D texture c_aff: affine transformation matrix, stored as constants

CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host } // device function declaration __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) nx, ny & nz: image dimensions (assumed same of both) b: output array for partial sums s: reference image (mislabelled in code)

CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host } int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero Which thread am I? (similar to MPI) however one thread for each x- y pixel, 240x256=61440 threads (CF ~128 nodes for MPI)

CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host } float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; // accumulates cost function contributions v.z=0.0f; // z of first slice is zero (redundant as done above) uint is = iy*nx+ix; // this is index of my voxel in first z-slice uint istep = nx*ny; // stride to index same voxel in subsequent slices Initialisations and first matrix multiply. “v” is 4-vector current voxel x,y,z address “tx,ty,tz” hold corresponding transformed position

CUDA Image Registration 29 Oct 2008 Richard Ansorge #include texture tex1; // Target Image in texture __constant__ float c_aff[16]; // 4x4 Affine transform // Function arguments are image dimensions and pointers to output buffer b // and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s) { int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matches int iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-y float x = (float)ix; float y = (float)iy; float z = 0.0f; // start with slice zero float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]); float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]); float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]); float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1? float tx = dot(r0,v); // Matrix Multiply using dot products float ty = dot(r1,v); float tz = dot(r2,v); float source = 0.0f; float target = 0.0f; float cost = 0.0f; uint is = iy*nx+ix; uint istep = nx*ny; for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); is += istep; v.z += 1.0f; tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host } for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); // NB very FAST trilinear interpolation!! is += istep; v.z += 1.0f; // step to next z slice tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required } b[iy*nx+ix]=cost; // store thread sum for host Loop sums contributions for all z values at fixed x,y position. Each tread updates a different element of 2D results array b. Y X Z

CUDA Image Registration 29 Oct 2008 Richard Ansorge Host Code Initialization Fragment... blockSize.x = blockSize.y = 16; // multiples of 16 a VERY good idea gridSize.x = (w2+15) / blockSize.x; gridSize.y = (h2+15) / blockSize.y; // allocate working buffers, image is W2 x H2 x D2 cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as “b” to kernel bufflen = w2*h2; Array1D shbuff = Array1D (bufflen); shbuff.Zero(); hbuff = shbuff.v; cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as “s” to kernel cudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice); e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origin o = make_float3(0.0f); // translations r = make_float3(0.0f); // rotations s = make_float3(1.0f,1.0f,1.0f); // scale factors t = make_float3(0.0f); // tans of shears...

CUDA Image Registration 29 Oct 2008 Richard Ansorge Calling the Kernel double nr_costfun(Array1D &a) { static Array2D affine = Array2D (4,4); // a holds current transformation double sum = 0.0; make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant mem d_costfun >>(w2,h2,d2,dbuff,dnewbuff); // run kernel CUT_CHECK_ERROR("kernel failed"); // OK? cudaThreadSynchronize(); // make sure all done // copy partial sums from device to host cudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost); for(int iy=0;iy<h2;iy++) for(int ix=0;ix<w2;ix++) sum += hbuff[iy*w2+ix]; // final sum calls++; if(verbose>1){ printf("call %d costfun %12.0f, a:",calls,sum); for(int i=0;i<a.sizex();i++)printf(" %f",a.v[i]); printf("\n"); } return sum; }

CUDA Image Registration 29 Oct 2008 Richard Ansorge Example Run (240x256x176 images) C: >airwc airwc v2.5 Usage: AirWc opts(12rtdgsf) C:>airwc sb1 sb2 junk 1f NIFTI Header on File sb1.nii converting short to float NIFTI Header on File sb2.nii converting short to float Using device 0: GeForce 9600 GT Initial correlation using cost function 1 (abs-difference) Amoeba time: 4297, calls 802, cost: Cuda Total time 4297, Total calls 802 File dofmat.mat written Nifti file junk.nii written, bswop=0 Full Time 6187 timer ms timer 1 0 ms timer ms timer ms timer 4 0 ms Total secs Final Transformation: Final rots and shifts scales and shears

CUDA Image Registration 29 Oct 2008 Richard Ansorge Desktop 3D Registration Registration with CUDA 6 Seconds Registration with FLIRT Minutes

CUDA Image Registration 29 Oct 2008 Richard Ansorge Comments This is actually already very useful. Almost interactive (add visualisation) Further speedups possible –Faster card –Smarter optimiser –Overlap IO and Kernel execution –Tweek CUDA code Extend to non-linear local registration

CUDA Image Registration 29 Oct 2008 Richard Ansorge Intel Larabee? Figure 1: Schematic of the Larabee many-core architecture: The number of CPU cores and the number and type of co-processors and I/O blocks are implementation-dependent, as are the positions of the CPU and non-CPU blocks on the chip. Porting from CUDA to Larabee should be easy

CUDA Image Registration 29 Oct 2008 Richard Ansorge Thank you