Parallel Beam Back Projection: Implementation Srdjan Coric Miriam Leeser Eric Miller
Outline Annapolis Wildstar “Simple Architecture” algorithm datapath Performance Results Parallelism extraction “Advanced Architecture 4x” Implementation issues Future directions
Data Flow Sinogram data address generation Sinogram data retrieval Linear interpolation Data accumulation write read Sinogram data prefetch
Interpolation factor error Corner starting position LUT1 starting position Critical error-accumulation path LUT1 quantization error LUT2 quantization error Bit reduction error LUT3 quantization error LUT2: LUT3: 15 . 2 LUT1: 10 5 1
“Simple Architecture” Datapath
Performance Results: Software vs. FPGA Hardware Software - Floating point - 450 MHz Pentium : ~ 240 s Software - Floating point - 1 GHz Dual Pentium : ~ 94 s Software - Fixed point - 450 MHz Pentium : ~ 50 s Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s Hardware - 50 MHz : ~ 5.4 s Parameters: 1024 projections 1024 samples per projection 512*512 pixels image 9-bit sinogram data 3-bit interpolation factor
Original image Hardware output image Zoom: ~200% Grayscale range < Pixel value range (heart features in focus)
Original image Hardware output image Zoom: ~200% Grayscale range < Pixel value range (lung features in focus)
Original image - Hardware output image
Parallelism Issues Case 1: No parallelism extracted Case 2: Pixel level parallelism extracted Case 3: Projection level parallelism extracted Projections Image columns V1 Image rows V3 V2 T~k1*V1 T~k1*V2 T~k2 * V3 k1 <k2, V2 = V3 = V1 /4, T=Execution time Memory bandwidth requirements at 50 MHz (for data accumulation) Case 1: 0.4 GB/s Case 2: 1.6 GB/s Case 3: 0.4 GB/s Memory bandwidth limit 1.2 GB/s
Advanced Architecture - Data Path projection parallelism extracted Simple Architecture
Performance Results: Software vs. FPGA Hardware Software - Floating point - 450 MHz Pentium : ~ 240 s Software - Floating point - 1 GHz Dual Pentium : ~ 94 s Software - Fixed point - 450 MHz Pentium : ~ 50 s Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s Hardware - 50 MHz : ~ 5.4 s Hardware (Advanced Architecture) - 50 MHz : ~ 1.3 s Parameters: 1024 projections 1024 samples per projection 512*512 pixels image 9-bit sinogram data 3-bit interpolation factor
Implementation Issues - fanout - prj_num(3) fanout = 1565 ! routing delay = 7.913 ns (~39.99%)
Implementation Issues - fanout - odd_2_A_4[4] fanout = 144 !
Memory Bridges Stuff 3 architectures implemented: “Simple Architecture” = non-parallel (on slide 6) “Advanced Architecture” = 4-way parallel (slide 12) “Bridge Free Advanced Arch” = as B but contains no memory bridges (all design buffers in BlockRAMs) from PCI bus to memory banks required for Host-Memory communication. Bridges are separate design that is downloaded before (after) design C is downloaded so that input data can be stored to (output data read from) memories on the WildStar board. Virtex1000 resource utilization: 11% logic, 90% BlockRAMs (with bridges) 39% logic, 100% BlockRAMs 21% logic, 100% BlockRAMs
“Bridge Free Advanced Architecture” (design C on the previous slide) Floorplan of the “Bridge Free Advanced Architecture” (design C on the previous slide)
Future Directions Graduate