Download presentation
Presentation is loading. Please wait.
Published byLily Gregory Modified over 9 years ago
1
A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows TIMOTHY BLATTNER NIST | UMBC 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 1
2
Outline Introduction Challenges Image Stitching Hybrid Task Graph Scheduler Preliminary Results Conclusions Future Work 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 2
3
Credits Walid Keyrouz (NIST) Milton Halem (UMBC) Shuvra Bhattacharrya (UMD) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 3
4
Introduction Hardware landscape is changing Traditional software approaches to extracting performance from the hardware ◦Reaching complexity limit ◦Multiple GPUs on a node ◦Complex memory hierarchies We present a novel abstract machine model ◦Hybrid task graph scheduler ◦Hybrid pipeline workflows ◦Scope: Single node with multiple CPUs and GPUs ◦Emphasis on ◦Execution pipelines to scale to multiple GPUs/CPU sockets ◦Memory interface to attach to hierarchies of memory ◦Can be expanded beyond single node (clusters) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 4
5
Introduction – Future Architectures Future hybrid architecture generation ◦Few fat cores with many more simpler cores ◦Intel Knights Landing ◦POWER 9 + NVIDIA Volta + NVLink ◦Sierra cluster ◦Faster interconnect ◦Deeper memory hierarchy Programming methods must present the right machine model to programmers so they can extract performance 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 5 Figure: NVIDIA Volta GPU (nvidia.com)
6
Introduction – Data transfer costs Copying data between address spaces is expensive ◦PCI express bottleneck Current hybrid CPU+GPU systems contain multiple independent address spaces ◦Unification of the address spaces ◦Simplification for programmer ◦Good for prototyping ◦Obscures the cost of data motion Techniques for improving hybrid utilization ◦Have enough computation per data element ◦Overlap data motion with computation ◦Faster bus (80 GB/s NVLink versus 16 GB/s PCIe) ◦NVLink requires multiple GPUs to reach peak performance [NVLink whitepaper 2014] 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 6
7
Introduction – Complex Memory Hierarchies Data locality is becoming more complex ◦Non-volatile storage devices ◦NVMe ◦3D XPoint (future) ◦SATA SSD ◦SATA HDD ◦Volatile memories ◦HBM / 3D stacked ◦DDR ◦GPU Shared Memory / L1,L2,L3 Cache Need to model these memories within programming methods ◦Effectively utilize based on size and speed ◦Hierarchy-aware programming 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 7 Figure: Memory hierarchies speed, cost, and capacity. [Ang et. Al. 2014]
8
Key Challenges Changing H/W landscape ◦Hierarchy-aware programming ◦Manage data locality ◦Wider data transfer channels ◦Requires multi-GPU computation ◦NVLink ◦Hybrid computing ◦Utilize all compute resource A programming and execution machine model is needed to address the above challenges ◦Hybrid Task Graph Scheduler (HTGS) model ◦Expands on hybrid pipeline workflows [Blattner 2013] 12/15/2015 8 GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING
9
Hybrid Pipeline Workflows Hybrid pipeline workflow system ◦Schedule tasks using a multiple-producer multiple-consumer model ◦Prototype in 2013 Master’s thesis [Blattner 2013] ◦Kept all GPUs busy ◦Execution pipelines, one per GPU ◦Stayed within memory limits ◦Overlapped data motion with computation ◦Tailored for image stitching ◦Required significant programming effort to implement ◦Prevent race conditions, manage dependencies, and maintain memory limits We expand on hybrid pipeline workflows ◦Formulates a model for a variety of algorithms ◦Will reduce programmer effort ◦Hybrid Task Graph Scheduler (HTGS) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 9
10
Hybrid Workflow Impact – Image Stitching Image Stitching ◦Addresses the scale mismatch between microscope field of view and a plate under study ◦Need to ‘stitch’ overlapping images to form one large image ◦Three compute stages ◦(S1) fast Fourier Transform (FFT) of an image ◦(S2) Phase correlation image alignment method (PCIAM) (Kuglin & Hines 1975) ◦(S3) Cross correlation factors (CCFs) Figure: Image stitching dataflow graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 10
11
Hybrid Workflow Impact – Image Stitching Implementation using traditional parallel techniques (Simple-GPU) ◦Port computationally intensive components to the GPU ◦Copy to/from GPU as needed ◦1.14x speedup end-to-end time compared to a sequential CPU-only implementation ◦Data motion dominated the run-time Implementation using hybrid workflow system ◦Reuse existing compute kernels ◦24x speedup end-to-end compared to Simple-GPU ◦Scales using multiple GPUs (~1.8x from one to two GPUs) ◦Requires significant programming effort [Blattner et al. 2014] 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 11
12
HTGS Motivation Performance gains using a hybrid pipeline workflow Figure 1: Simple-GPU Profile Figure 2: Hybrid Workflow Profile 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 12
13
HTGS Motivation Transforming dataflow graphs Into task graphs 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 13
14
Dataflow and Task Graphs Contains a series of vertices and edges ◦A vertex is a task/compute function ◦Implements a function applied on data ◦An edge is data flowing between tasks ◦Main difference between dataflow and task graphs ◦Scheduling ◦Effective method for representing MIMD concurrency Figure: Example task graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 14 Figure: Example dataflow graph
15
HTGS Motivation Scale to multiple GPUs ◦Partition task graph into sub-graphs ◦Bind sub-graph to separate GPUs Memory interface ◦Represent separate address spaces ◦CPU ◦GPU ◦Managing complex memory hierarchies (future) Overlap computation with I/O ◦Pipeline computation with I/O 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 15
16
Hybrid Task Graph Scheduler Model Four primary components ◦Tasks ◦Data ◦Dependency Rules ◦Memory Rules Construct task graphs using the four components ◦Vertices are tasks ◦Edges are data flow 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 16 Figure: Task graph
17
Hybrid Task Graph Scheduler Model Tasks ◦Programmer implements ‘execute’ ◦Defines functionality of the task ◦Special task types ◦GPU Tasks ◦Binds to device prior to execution ◦Bookkeeper ◦Manages dependencies ◦Threading ◦Each task is bound to one or more threads in a thread pool 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 17
18
CUDA Task Binds CUDA graphics card to a task ◦Provides CUDA context and stream to the execute function ◦1 CPU thread launches GPU kernels with thousands or millions of GPU threads Figure: CUDA Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 18
19
Memory Interface Attaches to a task needing reusable memory Memory is freed based on memory rules ◦Programmer defined Task requests memory from manager ◦Blocks if no memory is available Acts as a separate channel from dataflow Figure: Memory Manager Interface 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 19
20
Hybrid Task Graph Scheduler Model Execution Pipelines ◦Encapsulates a sub graph ◦Creates duplicate instances of the sub graph ◦Each instance is scheduled and executed using new threads ◦Can be distributed among available GPUs (one instance per GPU) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 20 Figure: Execution Pipeline Task
21
HTGS API Using the model, implement the HTGS API ◦Tasks ◦Default ◦Bookkeeper ◦Execution Pipeline ◦CUDA ◦Memory Interface ◦Attaches to any task to allocate/free/update memory 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 21
22
Prototype HTGS API – Image Stitching Full implementation in Java ◦Uses image stitching as a test case Figure: Image Stitching Task Graph 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 22
23
Preliminary Results Machine specifications ◦Two Xeon E5620 (16 logical cores) ◦Two NVIDIA Tesla C2070s and one GTX 680 ◦Libraries: JCuda and JCuFFT ◦Baseline implementation: [Blattner et al. 2014] ◦Problem size: 42x59 images (70% overlap) ◦HTGS prototype similar runtime as baseline, 23.6% reduction in code size HTGSExec PipelineGPUsRuntime (s)Lines of Code 329.8949 143.3725 141.4726 226.6726 324.5726 <- Baseline Hybrid pipeline workflow 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 23
24
Conclusions Prototype HTGS API ◦Reduces code size by 23.6% ◦Compared to the hybrid pipeline workflow implementation ◦Speedup of 17% ◦Enables multi-GPU execution by adding a single line of code Coarse-grained parallelism ◦Decomposition of algorithm and data structures ◦Memory management ◦Data locality ◦Scheduling 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 24
25
Conclusions The HTGS model and API ◦Scales using multiple GPUs and CPUs ◦Overlap data motion ◦Keeps processors busy ◦Memory interface for separate address spaces ◦Restricted to single node with multiple CPUs and multiple NVIDIA GPUs A Tool to represent complex, image processing algorithms that require high performance 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 25
26
Future Work Release of C++ implementation of HTGS (currently in development) Use HTGS with other classes of algorithms ◦Out-of-core matrix multiplication and LU factorization Expand execution pipelines to support clusters and Intel MIC Image Stitching with LIDE++ ◦Lightweight dataflow environment [Shen, Plishker, & Bhattacharyya 2012] ◦Tool-assisted acceleration ◦Annotated dataflow graphs ◦Manage memory and data motion ◦Enhanced scheduling ◦Improved concurrency 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 26
27
References [Ang et al. 2014] Ang, J. A.; Barrett, R. F.; Benner, R. E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, S. D.; Hemmert, K. S.; Kelly, S. M.; Le, H.; Leung, V. J.; Resnick, D. R.; Rodrigues, A. F.; Shalf, J.; Stark, D.; Unat, D.; and Wright, N. J. 2014. abstract machine models and proxy architectures for exascale computing. In proceedings of the 1st international workshop onhardware-software co-design for high performance computing, co-hpc ’14,25–32. ieee press. [Blattner et al. 2014] Blattner, T.; Keyrouz, W.; Chalfoun, J.; Stivalet, B.; Brady, M.; and Zhou, S. 2014. a hybrid cpu- gpu system for stitching large scale optical microscopy images. In 43rd international conference on parallel processing (icpp), 1–9. [Blattner 2013] Blattner, T. 2013. A Hybrid CPU/GPU Pipeline Workflow System. Master’s thesis, University of Maryland Baltimore County. [Shen, Plishker, & Bhattacharyya 2012] C. Shen, W. Plishker, and S. S. Bhattacharyya. Dataflow-based design and implementation of image processing applications. In L. Guan, Y. He, and S.-Y. Kung, editors, Multimedia Image and Video Processing, pages 609-629. CRC Press, second edition, 2012. Chapter 24 [Kuglin & Hines 1975] Kuglin, C. D., and Hines, D. C. 1975. the phase correlation image alignment method. In proceedings of the 1975 ieee international conference on cybernetics and society, 163–165. [NVLink Whitepaper 2014] NVIDIA 2014. http://www.nvidia.com/object/nvlink.html 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 27
28
Thank You Questions? 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 28
29
Introduction – Background Hybrid machines ◦Machine that contains one or more CPUs with co-processors ◦Intel Xeon Phi ◦NVIDIA Tesla Hybrid clusters are prominent in high performance computing ◦5 out of 10 fastest computers are hybrid (CPU + Co-processor) (top500 2015) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 29 Tweaktown: http://imagescdn.tweaktown.com/news/4/4/44113_01_one- stop-systems-shows-16-gpu-monster-machine-gtc-2015.jpg Figure: GPU expansion node, attaches to existing nodes to add 16 GPUs
30
Hybrid Task Graph Model Design To transform a dataflow representation into a task graph ◦Model for parallel execution to handle complexities ◦Dependency rules ◦Declare memory intensive components ◦Annotations to indicate memory reliance ◦CPU, I/O, & GPU components 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 30
31
Default Task Executes user-defined execute function Sends output data along output edge Bound to one or more threads in a thread pool ◦Specified at task creation Figure: Default Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 31
32
Bookkeeper Task Manages complex dependencies Programmer defines DepRule Each DepRule can be attached to tasks ◦Enables branches 1.Data enters 2.Control passes data to each DepRule 3.Rule may or may not pass data along output edge Figure: Bookkeeper Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 32
33
Hybrid Task Graph Scheduler Model Data ◦Encapsulates all needed data for a task Dependency Rules ◦Applies programmer-defined rules on data ◦May produce new data for other tasks ◦Managed by Bookkeeper Memory Rules ◦Defines how memory is allocated, freed, and updated ◦Attached to tasks as a separate channel ◦Separate from dataflow ◦Attached task can request for memory to be used in allocation ◦Should block if no data is available 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 33
34
Hybrid Task Graph Scheduler Model Execution Pipelines ◦Encapsulates a sub graph ◦Creates duplicate instances of the sub graph ◦Each instance is scheduled and executed using new threads ◦Can be distributed among available GPUs (one instance per GPU) Figure 2: Task graph with execution pipeline (duplicated 3 times) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 34 Figure 1: Task graph without execution pipeline
35
Execution Pipeline Task Partition graph into sub-task graphs ◦Duplicate partitions Each partition managed by a bookkeeper ◦Defines decomposition rules ◦How data is distributed Each partition is bound to a GPU ◦Shared among all GPU tasks in that sub-task graph Figure: Execution Pipeline Task 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 35
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.