A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION.

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 1

Outline Introduction Challenges Image Stitching Hybrid Task Graph Scheduler Preliminary Results Conclusions Future Work 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 2

Credits Walid Keyrouz (NIST) Milton Halem (UMBC) Shuvra Bhattacharrya (UMD) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 3

Introduction Hardware landscape is changing Traditional software approaches to extracting performance from the hardware ◦Reaching complexity limit ◦Multiple GPUs on a node ◦Complex memory hierarchies We present a novel abstract machine model ◦Hybrid task graph scheduler ◦Hybrid pipeline workflows ◦Scope: Single node with multiple CPUs and GPUs ◦Emphasis on ◦Execution pipelines to scale to multiple GPUs/CPU sockets ◦Memory interface to attach to hierarchies of memory ◦Can be expanded beyond single node (clusters) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 4

Introduction – Future Architectures Future hybrid architecture generation ◦Few fat cores with many more simpler cores ◦Intel Knights Landing ◦POWER 9 + NVIDIA Volta + NVLink ◦Sierra cluster ◦Faster interconnect ◦Deeper memory hierarchy Programming methods must present the right machine model to programmers so they can extract performance 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 5 Figure: NVIDIA Volta GPU (nvidia.com)

Introduction – Data transfer costs Copying data between address spaces is expensive ◦PCI express bottleneck Current hybrid CPU+GPU systems contain multiple independent address spaces ◦Unification of the address spaces ◦Simplification for programmer ◦Good for prototyping ◦Obscures the cost of data motion Techniques for improving hybrid utilization ◦Have enough computation per data element ◦Overlap data motion with computation ◦Faster bus (80 GB/s NVLink versus 16 GB/s PCIe) ◦NVLink requires multiple GPUs to reach peak performance [NVLink whitepaper 2014] 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 6

Introduction – Complex Memory Hierarchies Data locality is becoming more complex ◦Non-volatile storage devices ◦NVMe ◦3D XPoint (future) ◦SATA SSD ◦SATA HDD ◦Volatile memories ◦HBM / 3D stacked ◦DDR ◦GPU Shared Memory / L1,L2,L3 Cache Need to model these memories within programming methods ◦Effectively utilize based on size and speed ◦Hierarchy-aware programming 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 7 Figure: Memory hierarchies speed, cost, and capacity. [Ang et. Al. 2014]

Key Challenges Changing H/W landscape ◦Hierarchy-aware programming ◦Manage data locality ◦Wider data transfer channels ◦Requires multi-GPU computation ◦NVLink ◦Hybrid computing ◦Utilize all compute resource A programming and execution machine model is needed to address the above challenges ◦Hybrid Task Graph Scheduler (HTGS) model ◦Expands on hybrid pipeline workflows [Blattner 2013] 12/15/2015 8 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING

Hybrid Pipeline Workflows Hybrid pipeline workflow system ◦Schedule tasks using a multiple-producer multiple-consumer model ◦Prototype in 2013 Master’s thesis [Blattner 2013] ◦Kept all GPUs busy ◦Execution pipelines, one per GPU ◦Stayed within memory limits ◦Overlapped data motion with computation ◦Tailored for image stitching ◦Required significant programming effort to implement ◦Prevent race conditions, manage dependencies, and maintain memory limits We expand on hybrid pipeline workflows ◦Formulates a model for a variety of algorithms ◦Will reduce programmer effort ◦Hybrid Task Graph Scheduler (HTGS) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 9

Hybrid Workflow Impact – Image Stitching Image Stitching ◦Addresses the scale mismatch between microscope field of view and a plate under study ◦Need to ‘stitch’ overlapping images to form one large image ◦Three compute stages ◦(S1) fast Fourier Transform (FFT) of an image ◦(S2) Phase correlation image alignment method (PCIAM) (Kuglin & Hines 1975) ◦(S3) Cross correlation factors (CCFs) Figure: Image stitching dataflow graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 10

Hybrid Workflow Impact – Image Stitching Implementation using traditional parallel techniques (Simple-GPU) ◦Port computationally intensive components to the GPU ◦Copy to/from GPU as needed ◦1.14x speedup end-to-end time compared to a sequential CPU-only implementation ◦Data motion dominated the run-time Implementation using hybrid workflow system ◦Reuse existing compute kernels ◦24x speedup end-to-end compared to Simple-GPU ◦Scales using multiple GPUs (~1.8x from one to two GPUs) ◦Requires significant programming effort [Blattner et al. 2014] 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 11

HTGS Motivation Performance gains using a hybrid pipeline workflow Figure 1: Simple-GPU Profile Figure 2: Hybrid Workflow Profile 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 12

HTGS Motivation Transforming dataflow graphs Into task graphs 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 13

Dataflow and Task Graphs Contains a series of vertices and edges ◦A vertex is a task/compute function ◦Implements a function applied on data ◦An edge is data flowing between tasks ◦Main difference between dataflow and task graphs ◦Scheduling ◦Effective method for representing MIMD concurrency Figure: Example task graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 14 Figure: Example dataflow graph

HTGS Motivation Scale to multiple GPUs ◦Partition task graph into sub-graphs ◦Bind sub-graph to separate GPUs Memory interface ◦Represent separate address spaces ◦CPU ◦GPU ◦Managing complex memory hierarchies (future) Overlap computation with I/O ◦Pipeline computation with I/O 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 15

Hybrid Task Graph Scheduler Model Four primary components ◦Tasks ◦Data ◦Dependency Rules ◦Memory Rules Construct task graphs using the four components ◦Vertices are tasks ◦Edges are data flow 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 16 Figure: Task graph

Hybrid Task Graph Scheduler Model Tasks ◦Programmer implements ‘execute’ ◦Defines functionality of the task ◦Special task types ◦GPU Tasks ◦Binds to device prior to execution ◦Bookkeeper ◦Manages dependencies ◦Threading ◦Each task is bound to one or more threads in a thread pool 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 17

CUDA Task Binds CUDA graphics card to a task ◦Provides CUDA context and stream to the execute function ◦1 CPU thread launches GPU kernels with thousands or millions of GPU threads Figure: CUDA Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 18

Memory Interface Attaches to a task needing reusable memory Memory is freed based on memory rules ◦Programmer defined Task requests memory from manager ◦Blocks if no memory is available Acts as a separate channel from dataflow Figure: Memory Manager Interface 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 19

Hybrid Task Graph Scheduler Model Execution Pipelines ◦Encapsulates a sub graph ◦Creates duplicate instances of the sub graph ◦Each instance is scheduled and executed using new threads ◦Can be distributed among available GPUs (one instance per GPU) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 20 Figure: Execution Pipeline Task

HTGS API Using the model, implement the HTGS API ◦Tasks ◦Default ◦Bookkeeper ◦Execution Pipeline ◦CUDA ◦Memory Interface ◦Attaches to any task to allocate/free/update memory 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 21

Prototype HTGS API – Image Stitching Full implementation in Java ◦Uses image stitching as a test case Figure: Image Stitching Task Graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 22

Preliminary Results Machine specifications ◦Two Xeon E5620 (16 logical cores) ◦Two NVIDIA Tesla C2070s and one GTX 680 ◦Libraries: JCuda and JCuFFT ◦Baseline implementation: [Blattner et al. 2014] ◦Problem size: 42x59 images (70% overlap) ◦HTGS prototype similar runtime as baseline, 23.6% reduction in code size HTGSExec PipelineGPUsRuntime (s)Lines of Code  329.8949  143.3725 141.4726 226.6726 324.5726 <- Baseline Hybrid pipeline workflow 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 23

Conclusions Prototype HTGS API ◦Reduces code size by 23.6% ◦Compared to the hybrid pipeline workflow implementation ◦Speedup of 17% ◦Enables multi-GPU execution by adding a single line of code Coarse-grained parallelism ◦Decomposition of algorithm and data structures ◦Memory management ◦Data locality ◦Scheduling 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 24

Conclusions The HTGS model and API ◦Scales using multiple GPUs and CPUs ◦Overlap data motion ◦Keeps processors busy ◦Memory interface for separate address spaces ◦Restricted to single node with multiple CPUs and multiple NVIDIA GPUs A Tool to represent complex, image processing algorithms that require high performance 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 25

Future Work Release of C++ implementation of HTGS (currently in development) Use HTGS with other classes of algorithms ◦Out-of-core matrix multiplication and LU factorization Expand execution pipelines to support clusters and Intel MIC Image Stitching with LIDE++ ◦Lightweight dataflow environment [Shen, Plishker, & Bhattacharyya 2012] ◦Tool-assisted acceleration ◦Annotated dataflow graphs ◦Manage memory and data motion ◦Enhanced scheduling ◦Improved concurrency 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 26

References [Ang et al. 2014] Ang, J. A.; Barrett, R. F.; Benner, R. E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, S. D.; Hemmert, K. S.; Kelly, S. M.; Le, H.; Leung, V. J.; Resnick, D. R.; Rodrigues, A. F.; Shalf, J.; Stark, D.; Unat, D.; and Wright, N. J. 2014. abstract machine models and proxy architectures for exascale computing. In proceedings of the 1st international workshop onhardware-software co-design for high performance computing, co-hpc ’14,25–32. ieee press. [Blattner et al. 2014] Blattner, T.; Keyrouz, W.; Chalfoun, J.; Stivalet, B.; Brady, M.; and Zhou, S. 2014. a hybrid cpu- gpu system for stitching large scale optical microscopy images. In 43rd international conference on parallel processing (icpp), 1–9. [Blattner 2013] Blattner, T. 2013. A Hybrid CPU/GPU Pipeline Workﬂow System. Master’s thesis, University of Maryland Baltimore County. [Shen, Plishker, & Bhattacharyya 2012] C. Shen, W. Plishker, and S. S. Bhattacharyya. Dataflow-based design and implementation of image processing applications. In L. Guan, Y. He, and S.-Y. Kung, editors, Multimedia Image and Video Processing, pages 609-629. CRC Press, second edition, 2012. Chapter 24 [Kuglin & Hines 1975] Kuglin, C. D., and Hines, D. C. 1975. the phase correlation image alignment method. In proceedings of the 1975 ieee international conference on cybernetics and society, 163–165. [NVLink Whitepaper 2014] NVIDIA 2014. http://www.nvidia.com/object/nvlink.html 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 27

Thank You Questions? 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 28

Introduction – Background Hybrid machines ◦Machine that contains one or more CPUs with co-processors ◦Intel Xeon Phi ◦NVIDIA Tesla Hybrid clusters are prominent in high performance computing ◦5 out of 10 fastest computers are hybrid (CPU + Co-processor) (top500 2015) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 29 Tweaktown: http://imagescdn.tweaktown.com/news/4/4/44113_01_one- stop-systems-shows-16-gpu-monster-machine-gtc-2015.jpg Figure: GPU expansion node, attaches to existing nodes to add 16 GPUs

Hybrid Task Graph Model Design To transform a dataflow representation into a task graph ◦Model for parallel execution to handle complexities ◦Dependency rules ◦Declare memory intensive components ◦Annotations to indicate memory reliance ◦CPU, I/O, & GPU components 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 30

Default Task Executes user-defined execute function Sends output data along output edge Bound to one or more threads in a thread pool ◦Specified at task creation Figure: Default Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 31

Bookkeeper Task Manages complex dependencies Programmer defines DepRule Each DepRule can be attached to tasks ◦Enables branches 1.Data enters 2.Control passes data to each DepRule 3.Rule may or may not pass data along output edge Figure: Bookkeeper Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 32

Hybrid Task Graph Scheduler Model Data ◦Encapsulates all needed data for a task Dependency Rules ◦Applies programmer-defined rules on data ◦May produce new data for other tasks ◦Managed by Bookkeeper Memory Rules ◦Defines how memory is allocated, freed, and updated ◦Attached to tasks as a separate channel ◦Separate from dataflow ◦Attached task can request for memory to be used in allocation ◦Should block if no data is available 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 33

Hybrid Task Graph Scheduler Model Execution Pipelines ◦Encapsulates a sub graph ◦Creates duplicate instances of the sub graph ◦Each instance is scheduled and executed using new threads ◦Can be distributed among available GPUs (one instance per GPU) Figure 2: Task graph with execution pipeline (duplicated 3 times) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 34 Figure 1: Task graph without execution pipeline

Execution Pipeline Task Partition graph into sub-task graphs ◦Duplicate partitions Each partition managed by a bookkeeper ◦Defines decomposition rules ◦How data is distributed Each partition is bound to a GPU ◦Shared among all GPU tasks in that sub-task graph Figure: Execution Pipeline Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 35

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION.

Similar presentations

Presentation on theme: "A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION.

Similar presentations

Presentation on theme: "A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION."— Presentation transcript:

Similar presentations

About project

Feedback