University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1, Scott.

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

Lecture 6: Multicore Systems

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

EEL6686 Guest Lecture February 25, 2014 A Framework to Analyze Processor Architectures for Next-Generation On-Board Space Computing Tyler M. Lovelly Ph.D.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

ARM-DSP Multicore Considerations CT Scan Example.

Mark Mirotznik, Ph.D. Associate Professor The University of Delaware

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Image Reconstruction T , Biomedical Image Analysis Seminar Presentation Seppo Mattila & Mika Pollari.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Basic principles Geometry and historical development

Sunpyo Hong, Hyesoon Kim

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Medical Image Analysis Image Reconstruction Figures come from the textbook: Medical Image Analysis, by Atam P. Dhawan, IEEE Press, 2003.

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

Single-Slice Rebinning Method for Helical Cone-Beam CT

B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

X-Rays Lo: To know how x-rays are used in medical physics.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

K-Nearest Neighbor Digit Recognition ApplicationDomainConstraintsKernels/Algorithms Voice Removal and Pitch ShiftingAudio ProcessingLatency (Real-Time)FFT,

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

GPU-based iterative CT reconstruction

Graphics Processing Unit

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Basic principles Geometry and historical development

CSE 502: Computer Architecture

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan 1, Scott Mahlke 1 Parakinetics, Inc. University of Michigan Advanced Computer Architecture Laboratory

University of Michigan Electrical Engineering and Computer Science 2 The Advent of the GPGPU Increasingly popular substrate for HPC –Astrophysics –Weather Prediction –EDA –Financial instrument pricing –Medical Imaging

University of Michigan Electrical Engineering and Computer Science 3 Advantages of GPGPUs High degree of parallelism –Data-level –Thread-level High bandwidth Commodity products Increasingly programmable

University of Michigan Electrical Engineering and Computer Science 4 Disadvantages of GPGPUs Gap between computation and bandwidth –933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio) Very high power consumption –Graphics-specific hardware –Multiple thread contexts –Large register files and memories –Fully general datapath Inefficiencies in all general-purpose architectures

University of Michigan Electrical Engineering and Computer Science 5 Programmability vs Efficiency? FPGAs General Purpose Processors DSPs Domain-specific Accelerators, GPGPUs Efficiency Flexibility 5 Loop Accelerators, ASICs ??? Highly efficient, some programmability

University of Michigan Electrical Engineering and Computer Science 6 Medical Image Reconstruction Compute intensive loops –32-bit floating point code –High data/bandwidth requirements Increased demand for portability, low power Much current research focuses on using GPGPUs for this domain

University of Michigan Electrical Engineering and Computer Science 7 CT Image reconstruction X-Ray emitters and receptors on opposite sides of patients Received x-ray intensity corresponds to tissue density Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image

University of Michigan Electrical Engineering and Computer Science 8 Projection & Sinogram Sinogram: All projections Projection: All ray-sums in a direction P(  t) f(x,y) t  y x X-rays Sinogram t  

University of Michigan Electrical Engineering and Computer Science 9 Example: Backprojection SinogramBackprojected Image

University of Michigan Electrical Engineering and Computer Science 10 Example: Filtered Backprojection Filtered Sinogram Reconstructed Image

University of Michigan Electrical Engineering and Computer Science 11 Reconstruction: Solve for  ’s  11  12  13  14  21  22  23  24  31  32  33  34  41  42  43  X-Ray Emitter Detector Values Densities “Human Body“

University of Michigan Electrical Engineering and Computer Science 12 Real Reconstruction Problem Intensity measured Rays transmitted through multiple “pixels” Find individual “pixel” values from transmission data ?????? ?????? ?????? ?????? ?????? ?????? values 512 values 100’s of 100’s of angles

University of Michigan Electrical Engineering and Computer Science 13 Medical Imaging Applications Image reconstruction for MRI/CT/PET scans Large amounts of Vector/Thread-level parallelism FP-intensive kernels –Often requiring math library functions Data-intensive (~5:1 compute:mem ratio) Benchmark Inner-loop %Scalar/Vector Outer-loop TLP Compute:Mem ratio SegmentationFully vectorizableDo-all4:1 Laplacian FilteringFully vectorizableDo-all3:1 Gaussian Convolution Fully vectorizable with predicates Do-all6:1 MRI FH VectorFully vectorizableDo-all6:1 MRI Q VectorFully vectorizableDo-all5.5:1

University of Michigan Electrical Engineering and Computer Science 14 Currently, most scans require moving patient to imaging room –Consumes time –Stress on patient Studies show benefits of portable, bed-side scanners: –86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA] –80-100% drop in scan-related complications [Gunnarsson et al, J. of Neurosurgery] New X-Ray emitters push for mAs of current use Current Concerns: Portability/Power

University of Michigan Electrical Engineering and Computer Science 15 Current Concerns: Performance High-accuracy CT algorithms take too long –Iterative forward/backward projection –~Hours on modern CT scanners instead of minutes Interventional radiology –Scans currently takes minutes, but should take seconds CT-Flouroscopy –Several scans done in succession

University of Michigan Electrical Engineering and Computer Science 16 Flexibility Software algorithms change over time NRE Time-to-market 16

University of Michigan Electrical Engineering and Computer Science 17 PUMA Tiled architecture Bandwidth-matched for improved efficiency Each tile is a “Programmable Loop Accelerator” Extern. Interface CPU Mem Disk …

University of Michigan Electrical Engineering and Computer Science 18 Programmable Loop Accelerator Generalize accelerator without losing efficiency FPGAs Efficiency, Performance Flexibility Loop Accelerators, ASICs Programmable Loop Accelerators 18 General Purpose Processors DSPs Domain-specific Accelerators, GPGPUs ???

University of Michigan Electrical Engineering and Computer Science 19 Designing Loop Accelerators C Code Loop 19 Hardware Point-to-point Connections BR CRF + …… & …… MEM …… Local Mem + …… * …… MEM …… << …… Local Mem

University of Michigan Electrical Engineering and Computer Science 20 Loop Accelerator Architecture Point-to-point Connections + …… & …… MEM …… Local Mem FSM Control signals CRF BR Hardware realization of modulo scheduled loop Parameterized hardware: FUs Shift Register Files 20 Static Control Point-to-point Interconnect

University of Michigan Electrical Engineering and Computer Science 21 Programmable Loop-Accelerator Architecture Point-to-point Connections +/- …… &/| …… MEM …… Local Mem Control Memory Control signals CRF BR RR Literals Ring  Functionality  Storage  Connectivity  Control LA PLA Custom FU setGeneralized FUs + MOVs Point-to-pointRing + Port-swapping Limited size, no addr.Rotating Reg. Files Hardwired ControlLit. Reg. File + Control Mem 21 +& SRF FSM

University of Michigan Electrical Engineering and Computer Science 22 MRI.FH PLA ~0.6 mm 2 per tile 38 FUs bit registers Inter-FU BW 1 TB/sec FU Type# FP-ADDSUB6 FP-MPY9 I-ADDSUB8 MEM9 I-MPY1 Other5

University of Michigan Electrical Engineering and Computer Science 23 Performance on MRI.FH PLA II preserved II doubled Unschedulable

University of Michigan Electrical Engineering and Computer Science 24 Efficiency on MRI.FH PLA

University of Michigan Electrical Engineering and Computer Science 25 PUMA System Design 5 systems designed around 5 benchmarks Each composed of identical tiles Assume same B/W as GTX280 (142 GB/s) # Tiles based on B/W requirements of benchmark Extern. Interface CPU Mem Disk …

University of Michigan Electrical Engineering and Computer Science 26 System Performance 4W3W2.8W2.3W2.7W

University of Michigan Electrical Engineering and Computer Science 27 Performance vs. GPGPU 63% performance of GTX 295 2X performance of GTS 250

University of Michigan Electrical Engineering and Computer Science 28 Efficiency vs. GPGPU 22X 54X

University of Michigan Electrical Engineering and Computer Science 29 Conclusions Power-efficient accelerator for medical imaging ASIC-like efficiency with programmability % of GPU performance 22-54X GPU Performance/Power efficiency

University of Michigan Electrical Engineering and Computer Science 30 Thank you!! Questions?