A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION.

Slides:

Advertisements

Similar presentations

Image Stitching for Optical Microscopy

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

Multi-GPU System Design with Memory Networks

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Panda: MapReduce Framework on GPU’s and CPU’s

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

GPU Architecture and Programming

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Threads. Readings r Silberschatz et al : Chapter 4.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Chapter 4 – Thread Concepts

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

GPU Architecture and Its Application

NFV Compute Acceleration APIs and Evaluation

TensorFlow– A system for large-scale machine learning

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 4 – Thread Concepts

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Introduction to cosynthesis Rabi Mahapatra CSCE617

Multithreaded Programming

6- General Purpose GPU Programming

Presentation transcript:

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 1

Outline Introduction Challenges Image Stitching Hybrid Task Graph Scheduler Preliminary Results Conclusions Future Work 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 2

Credits Walid Keyrouz (NIST) Milton Halem (UMBC) Shuvra Bhattacharrya (UMD) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 3

Introduction Hardware landscape is changing Traditional software approaches to extracting performance from the hardware ◦Reaching complexity limit ◦Multiple GPUs on a node ◦Complex memory hierarchies We present a novel abstract machine model ◦Hybrid task graph scheduler ◦Hybrid pipeline workflows ◦Scope: Single node with multiple CPUs and GPUs ◦Emphasis on ◦Execution pipelines to scale to multiple GPUs/CPU sockets ◦Memory interface to attach to hierarchies of memory ◦Can be expanded beyond single node (clusters) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 4

Introduction – Future Architectures Future hybrid architecture generation ◦Few fat cores with many more simpler cores ◦Intel Knights Landing ◦POWER 9 + NVIDIA Volta + NVLink ◦Sierra cluster ◦Faster interconnect ◦Deeper memory hierarchy Programming methods must present the right machine model to programmers so they can extract performance 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 5 Figure: NVIDIA Volta GPU (nvidia.com)

Introduction – Data transfer costs Copying data between address spaces is expensive ◦PCI express bottleneck Current hybrid CPU+GPU systems contain multiple independent address spaces ◦Unification of the address spaces ◦Simplification for programmer ◦Good for prototyping ◦Obscures the cost of data motion Techniques for improving hybrid utilization ◦Have enough computation per data element ◦Overlap data motion with computation ◦Faster bus (80 GB/s NVLink versus 16 GB/s PCIe) ◦NVLink requires multiple GPUs to reach peak performance [NVLink whitepaper 2014] 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 6

Introduction – Complex Memory Hierarchies Data locality is becoming more complex ◦Non-volatile storage devices ◦NVMe ◦3D XPoint (future) ◦SATA SSD ◦SATA HDD ◦Volatile memories ◦HBM / 3D stacked ◦DDR ◦GPU Shared Memory / L1,L2,L3 Cache Need to model these memories within programming methods ◦Effectively utilize based on size and speed ◦Hierarchy-aware programming 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 7 Figure: Memory hierarchies speed, cost, and capacity. [Ang et. Al. 2014]

Key Challenges Changing H/W landscape ◦Hierarchy-aware programming ◦Manage data locality ◦Wider data transfer channels ◦Requires multi-GPU computation ◦NVLink ◦Hybrid computing ◦Utilize all compute resource A programming and execution machine model is needed to address the above challenges ◦Hybrid Task Graph Scheduler (HTGS) model ◦Expands on hybrid pipeline workflows [Blattner 2013] 12/15/ GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING

Hybrid Pipeline Workflows Hybrid pipeline workflow system ◦Schedule tasks using a multiple-producer multiple-consumer model ◦Prototype in 2013 Master’s thesis [Blattner 2013] ◦Kept all GPUs busy ◦Execution pipelines, one per GPU ◦Stayed within memory limits ◦Overlapped data motion with computation ◦Tailored for image stitching ◦Required significant programming effort to implement ◦Prevent race conditions, manage dependencies, and maintain memory limits We expand on hybrid pipeline workflows ◦Formulates a model for a variety of algorithms ◦Will reduce programmer effort ◦Hybrid Task Graph Scheduler (HTGS) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 9

Hybrid Workflow Impact – Image Stitching Image Stitching ◦Addresses the scale mismatch between microscope field of view and a plate under study ◦Need to ‘stitch’ overlapping images to form one large image ◦Three compute stages ◦(S1) fast Fourier Transform (FFT) of an image ◦(S2) Phase correlation image alignment method (PCIAM) (Kuglin & Hines 1975) ◦(S3) Cross correlation factors (CCFs) Figure: Image stitching dataflow graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 10

Hybrid Workflow Impact – Image Stitching Implementation using traditional parallel techniques (Simple-GPU) ◦Port computationally intensive components to the GPU ◦Copy to/from GPU as needed ◦1.14x speedup end-to-end time compared to a sequential CPU-only implementation ◦Data motion dominated the run-time Implementation using hybrid workflow system ◦Reuse existing compute kernels ◦24x speedup end-to-end compared to Simple-GPU ◦Scales using multiple GPUs (~1.8x from one to two GPUs) ◦Requires significant programming effort [Blattner et al. 2014] 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 11

HTGS Motivation Performance gains using a hybrid pipeline workflow Figure 1: Simple-GPU Profile Figure 2: Hybrid Workflow Profile 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 12

HTGS Motivation Transforming dataflow graphs Into task graphs 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 13

Dataflow and Task Graphs Contains a series of vertices and edges ◦A vertex is a task/compute function ◦Implements a function applied on data ◦An edge is data flowing between tasks ◦Main difference between dataflow and task graphs ◦Scheduling ◦Effective method for representing MIMD concurrency Figure: Example task graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 14 Figure: Example dataflow graph

HTGS Motivation Scale to multiple GPUs ◦Partition task graph into sub-graphs ◦Bind sub-graph to separate GPUs Memory interface ◦Represent separate address spaces ◦CPU ◦GPU ◦Managing complex memory hierarchies (future) Overlap computation with I/O ◦Pipeline computation with I/O 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 15

Hybrid Task Graph Scheduler Model Four primary components ◦Tasks ◦Data ◦Dependency Rules ◦Memory Rules Construct task graphs using the four components ◦Vertices are tasks ◦Edges are data flow 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 16 Figure: Task graph

Hybrid Task Graph Scheduler Model Tasks ◦Programmer implements ‘execute’ ◦Defines functionality of the task ◦Special task types ◦GPU Tasks ◦Binds to device prior to execution ◦Bookkeeper ◦Manages dependencies ◦Threading ◦Each task is bound to one or more threads in a thread pool 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 17

CUDA Task Binds CUDA graphics card to a task ◦Provides CUDA context and stream to the execute function ◦1 CPU thread launches GPU kernels with thousands or millions of GPU threads Figure: CUDA Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 18

Memory Interface Attaches to a task needing reusable memory Memory is freed based on memory rules ◦Programmer defined Task requests memory from manager ◦Blocks if no memory is available Acts as a separate channel from dataflow Figure: Memory Manager Interface 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 19

Hybrid Task Graph Scheduler Model Execution Pipelines ◦Encapsulates a sub graph ◦Creates duplicate instances of the sub graph ◦Each instance is scheduled and executed using new threads ◦Can be distributed among available GPUs (one instance per GPU) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 20 Figure: Execution Pipeline Task

HTGS API Using the model, implement the HTGS API ◦Tasks ◦Default ◦Bookkeeper ◦Execution Pipeline ◦CUDA ◦Memory Interface ◦Attaches to any task to allocate/free/update memory 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 21

Prototype HTGS API – Image Stitching Full implementation in Java ◦Uses image stitching as a test case Figure: Image Stitching Task Graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 22

Preliminary Results Machine specifications ◦Two Xeon E5620 (16 logical cores) ◦Two NVIDIA Tesla C2070s and one GTX 680 ◦Libraries: JCuda and JCuFFT ◦Baseline implementation: [Blattner et al. 2014] ◦Problem size: 42x59 images (70% overlap) ◦HTGS prototype similar runtime as baseline, 23.6% reduction in code size HTGSExec PipelineGPUsRuntime (s)Lines of Code   <- Baseline Hybrid pipeline workflow 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 23

Conclusions Prototype HTGS API ◦Reduces code size by 23.6% ◦Compared to the hybrid pipeline workflow implementation ◦Speedup of 17% ◦Enables multi-GPU execution by adding a single line of code Coarse-grained parallelism ◦Decomposition of algorithm and data structures ◦Memory management ◦Data locality ◦Scheduling 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 24

Conclusions The HTGS model and API ◦Scales using multiple GPUs and CPUs ◦Overlap data motion ◦Keeps processors busy ◦Memory interface for separate address spaces ◦Restricted to single node with multiple CPUs and multiple NVIDIA GPUs A Tool to represent complex, image processing algorithms that require high performance 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 25

Future Work Release of C++ implementation of HTGS (currently in development) Use HTGS with other classes of algorithms ◦Out-of-core matrix multiplication and LU factorization Expand execution pipelines to support clusters and Intel MIC Image Stitching with LIDE++ ◦Lightweight dataflow environment [Shen, Plishker, & Bhattacharyya 2012] ◦Tool-assisted acceleration ◦Annotated dataflow graphs ◦Manage memory and data motion ◦Enhanced scheduling ◦Improved concurrency 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 26

References [Ang et al. 2014] Ang, J. A.; Barrett, R. F.; Benner, R. E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, S. D.; Hemmert, K. S.; Kelly, S. M.; Le, H.; Leung, V. J.; Resnick, D. R.; Rodrigues, A. F.; Shalf, J.; Stark, D.; Unat, D.; and Wright, N. J abstract machine models and proxy architectures for exascale computing. In proceedings of the 1st international workshop onhardware-software co-design for high performance computing, co-hpc ’14,25–32. ieee press. [Blattner et al. 2014] Blattner, T.; Keyrouz, W.; Chalfoun, J.; Stivalet, B.; Brady, M.; and Zhou, S a hybrid cpu- gpu system for stitching large scale optical microscopy images. In 43rd international conference on parallel processing (icpp), 1–9. [Blattner 2013] Blattner, T A Hybrid CPU/GPU Pipeline Workﬂow System. Master’s thesis, University of Maryland Baltimore County. [Shen, Plishker, & Bhattacharyya 2012] C. Shen, W. Plishker, and S. S. Bhattacharyya. Dataflow-based design and implementation of image processing applications. In L. Guan, Y. He, and S.-Y. Kung, editors, Multimedia Image and Video Processing, pages CRC Press, second edition, Chapter 24 [Kuglin & Hines 1975] Kuglin, C. D., and Hines, D. C the phase correlation image alignment method. In proceedings of the 1975 ieee international conference on cybernetics and society, 163–165. [NVLink Whitepaper 2014] NVIDIA /15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 27

Thank You Questions? 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 28

Introduction – Background Hybrid machines ◦Machine that contains one or more CPUs with co-processors ◦Intel Xeon Phi ◦NVIDIA Tesla Hybrid clusters are prominent in high performance computing ◦5 out of 10 fastest computers are hybrid (CPU + Co-processor) (top ) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 29 Tweaktown: stop-systems-shows-16-gpu-monster-machine-gtc-2015.jpg Figure: GPU expansion node, attaches to existing nodes to add 16 GPUs

Hybrid Task Graph Model Design To transform a dataflow representation into a task graph ◦Model for parallel execution to handle complexities ◦Dependency rules ◦Declare memory intensive components ◦Annotations to indicate memory reliance ◦CPU, I/O, & GPU components 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 30

Default Task Executes user-defined execute function Sends output data along output edge Bound to one or more threads in a thread pool ◦Specified at task creation Figure: Default Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 31

Bookkeeper Task Manages complex dependencies Programmer defines DepRule Each DepRule can be attached to tasks ◦Enables branches 1.Data enters 2.Control passes data to each DepRule 3.Rule may or may not pass data along output edge Figure: Bookkeeper Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 32

Hybrid Task Graph Scheduler Model Data ◦Encapsulates all needed data for a task Dependency Rules ◦Applies programmer-defined rules on data ◦May produce new data for other tasks ◦Managed by Bookkeeper Memory Rules ◦Defines how memory is allocated, freed, and updated ◦Attached to tasks as a separate channel ◦Separate from dataflow ◦Attached task can request for memory to be used in allocation ◦Should block if no data is available 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 33

Hybrid Task Graph Scheduler Model Execution Pipelines ◦Encapsulates a sub graph ◦Creates duplicate instances of the sub graph ◦Each instance is scheduled and executed using new threads ◦Can be distributed among available GPUs (one instance per GPU) Figure 2: Task graph with execution pipeline (duplicated 3 times) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 34 Figure 1: Task graph without execution pipeline

Execution Pipeline Task Partition graph into sub-task graphs ◦Duplicate partitions Each partition managed by a bookkeeper ◦Defines decomposition rules ◦How data is distributed Each partition is bound to a GPU ◦Shared among all GPU tasks in that sub-task graph Figure: Execution Pipeline Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 35