Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Software Language Levels Machine Language (Binary) Assembly Language –Assembler converts Assembly into machine High Level Languages (C, Perl, Shell)

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Contemporary Languages in Parallel Computing Raymond Hummel.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

© 2010 Michael Boyer1 Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor:

Types of software. Sonam Dema..

1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SAGE: Self-Tuning Approximation for Graphics Engines

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Optimizing the trace transform Using OpenMP and CUDA Tim Besard

Languages and Environments Higher Computing Unit 2 – Software Development.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

GPU Programming with CUDA – Optimisation Mike Griffiths

Cg Programming Mapping Computational Concepts to GPUs.

Accelerating MATLAB with CUDA

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on an embedded system School of Information Technology & Mathematical Sciences September.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

Chapter – 8 Software Tools.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Performed by:Liran Sperling Gal Braun Instructor: Evgeny Fiksman המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

NFV Compute Acceleration APIs and Evaluation

Gwangsun Kim, Jiyun Jeong, John Kim

Analysis of Sparse Convolutional Neural Networks

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CS427 Multicore Architecture and Parallel Computing

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Linchuan Chen, Xin Huo and Gagan Agrawal

Chapter 5 - Functions Outline 5.1 Introduction

The Yin and Yang of Processing Data Warehousing Queries on GPUs

6- General Purpose GPU Programming

Presentation transcript:

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia

2 Outline MATLAB Optimizations to MATLAB GPU Acceleration with CUDA Applications (Heart Wall Tracking and Myocyte Simulation) –Problem –Algorithm –Optimization and performance –Lessons Conclusions Future Research

MATLAB Convenient but inefficient programming language of choice for scientists - Interpreted language - Most of the existing code and libraries are single-threaded MATLAB Parallel Toolbox - understanding of parallel programming Jacket and GPUmat - large parallelism to justify overhead 3

MATLAB contd. Interpreted language optimized by JIT compiler – 2x slower than C MATLAB Embedded Compiler has limited support – x slower than C MEX Interface to link C code - translating to C - many functions written from scratch - no support for convenient OpenMP standard, need to use thread libraries 4

5 Acceleration 1.Translation: - convert MATLAB to C 2.Parallelization: –C for multi-core CPU –CUDA for GPU Experimental Setup –CPU: 3.2 GHz quad-core Intel Core 2 Extreme –GPU: NVIDIA GeForce GTX 280 (PCIe 2.0) –MS Windows, MS C Compiler

6 Allocate GPU memory Transfer inputs Launch kernel Return to CPU Transfer results Free GPU memory C Program CUDA Kernel CPUGPU Acceleration with GPU (CUDA)

Entire Heart Wall Application read first frame from input file, display image [0.533 s] [0.28 %] crop image, display image [0.089 s] [0.05 %] SRAD, display image [3.224 s] [1.72 %] detect edges, display image [0.448 s] [0.24 %] morphological transformation, display image [0.275 s] [0.15 %] dilate image, display image [0.285 s] [0.15 %] inner and outer ellipse parameter setup [0.001 s] [0.00 %] create ellipse sample points [0.726 s] [0.39 %] track movement of sample points in all frames [ s] [68.99 %] display movement of sample points through frames [36.63 s] [19.55 %] save outputs into file [0.035 s] [0.02 %] Hough Search, display images [ s] [8.47 %]

8 Heart Wall Tracking Description Speed and shape of contractions provides important information about body’s response to stimulus Measured by tracking inner and outer heart walls through multiple frames InputOutput Tracking

9 Heart Wall Tracking Algorithm Processing 20 inner and 30 outer heart wall points, total 50 points (TLP) Processing of each point - sequence of operations on the surrounding area and template (DLP) Update templates Read next frame Track inner point Track outer point Save point locations # of frames / 10 … time task-level parallelism (TLP) data-level parallelism (DLP)

10 Heart Wall Tracking Performance Times reported for processing of 300 frames (10s of ultrasound recording) 1.21x 2.71x 10.86x 2.37x 3.39x 5.94x 6.29x 33.20x36.89x41.23x

Porting Tradeoffs ConveniencePerformance Developing modular code replacing each MATLAB function with equivalent parallelized GPU function common routines can be possibly reused in other applications Hiding specific aspects of GPU programming inside each module each module sets up its own GPU execution parameters each module performs its own I/O with GPU transparently Restructuring and combining code writing code based on algorithm tasks rather than MATLAB statements overlap parallel (often unrelated) tasks by executing them in the same kernel call Exposing specific aspects of GPU programming doing more work inside each GPU kernel call to fully exploit parallelism and avoid overhead performing I/O with GPU manually to eliminate redundant data transfers

multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution multiple add,sub, mul, div SYNC Reduction SYNC multiple add,sub, mul, div SYNC convolution time SMs outer heart points (30) inner heart points (20) … … 12330… global SYNC Frame 1 Frame 2 …

13 Heart Wall Tracking Lessons Typical MATLAB code written by a scientist has room for optimization – 1.3x Conversion to C requires significant coding effort Selective offloading results in multiple CPU-GPU data transfer overheads Iterative codes require merging kernels and reusing variables to avoid overhead CUDA libraries cannot be used as a part of GPU code Good performance - significant changes to the structure of code, difficult for a scientist to understand

14 Conclusions Limited availability of C libraries necessitates time consuming coding Many systems biology applications (even those with limited parallelism) benefit from GPU GPU overheads are significant (should be eliminated in new CPU-GPU architectures) Real-time processing feasible in near future Ultimately, acceleration of applications should be automated!

15 Future Research Automatic acceleration with the use of compiler - via use of architecture-specific libraries - via compiling for target architecture Merging of workloads - based on resource needs - based on dependency Acceleration with alternative architectures - well suited for fine-grained parallelism - esp. FPGA

16 Acknowledgements Funding provided by: –NSF grant IIS –SRC grant Equipment donated by NVIDIA

17 Software Source code will be soon available at:

Questions 18

Backup 19