Download presentation
Presentation is loading. Please wait.
Published byRodger Randall Modified over 9 years ago
1
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW 2010) present by 陳彥廷 2012.05.31 1
2
Outline Introduction Unified execution model Example Experiment Conclusion Future work 2
3
Introduction Recent trends in computer systems are relying on a heterogeneous paradigm as their basic architectural principle. At present, almost every currently available commodity desktop computer stands for a unique heterogeneous system. In general, heterogeneous systems can be modeled as a set of interconnected computational resources with distributed address spaces and diverse functionalities. 3
4
Introduction (cont) Heterogeneous system Master-slave execution paradigm 4
5
Programming challenges in heterogeneous system Computation Partitioning To fulfill device capabilities/limitations and achieve optimal load- balancing Data Migration Significant and usually asymmetric Potential execution bottleneck Synchronization Devices can not communicate between each other => CPU in charge Different programming models Per device type and vendor-specific High performance libraries and software Application Optimization Very large set of parameters and solutions affects performance 5
6
Outline Introduction Unified execution model Example Experiment Conclusion Future work 6
7
Abstract structure of the task 7
8
Abstract structure of the task (cont) Task - coarser-grained, basic programming unit Primitive Jobs Finer-grained Minimal program portions for parallel execution Partitioned into Host and Device Code Host Code - embraces the necessary data arrangement operations executed only on the host processor prior the any device kernel call. Device Code - a set of functions to drive direct on-device execution 8
9
Abstract structure of the task (cont) Divisible into finer-grained Primitive Jobs Agglomerative grouping of Primitive Jobs Not Divisible Divisible Not agglomerative Agglomerative 9
10
Unified execution model Task Scheduler Selects the next task for execution according to the configuration parameters, device availability and dependencies Job Dispatcher Assigns a requested device to the task Initiates and controls the on-device execution Synchronization between host and device Device Query Identifies and examines all underlying devices Holds per-device information resource type, status, memory management and performance history 10
11
Unified execution model (cont) Job Queue Arranges the Primitive Jobs into structures according to the parameters from the task properties Job Dispatcher Search over a set of Primitive Jobs Mapping to the requested devices 11
12
Unified execution model (cont) Job Queue Arranges the Primitive Jobs into structures according to the parameters from the task properties Job Dispatcher Search over a set of Primitive Jobs Mapping to the requested devices Agglomeration – select and group the Primitive Jobs into the Job batches 12
13
Parallelism Task Level Parallelism Scheduler free to send independent tasks to the Job Dispatcher Data Level Parallelism Different portions of a single task are executed on several devices simultaneously Nested Parallelism Multi-core device is viewed as a single device by the Job Dispatcher If provided by application 13
14
Outline Introduction Unified execution model Example Experiment Conclusion Future work 14
15
Example Matrix multiplication 3D FFT(Fast Fourier transform) 15
16
Matrix multiplication 16
17
Matrix multiplication Horowitz scheme Based on divide-and-conquer 17
18
3D FFT 18
19
Outline Introduction Unified execution model Example Experiment Conclusion Future work 19
20
Experiment platform CPU - Intel Core 2 Quad Q9550 processor, 12 MB L2 cache, running at 2.83 GHz, and 4 GB of DDR2 RAM GPU - nVidia GeForce 285 GTX with 1.476 GHz of core frequency and 1 GB of global memory Interconnection bus - via Memory Controller Hub with 1.33 GHz Front Side Bus to the CPU, whereas PCI Express 2.0 16x is used at the GPU side OS - Linux Open Suse 11.1 20
21
Experimental results – matrix multiplication 21 MKL (Math Kernel Library)
22
Experimental results - FFT 22
23
Experimental results – FFT (cont) 23
24
Experimental results – FFT (cont) 24
25
Outline Introduction Unified execution model Example Experiment Conclusion Future work 25
26
Conclusion This paper proposed a collaborative execution environment for such heterogeneous systems, which have been used to program parallel applications by exploiting task and data parallelism. Experimental results show that significant performance benefits are achieved when both CPU and GPU are used in case of matrix multiplication, whereas the available interconnection bandwidth between CPU and GPU limits the performance for FFT batches. 26
27
Outline Introduction Unified execution model Example Experiment Conclusion Future work 27
28
Future work Systems with higher level of heterogeneity (more GPUs, FPGAs, or special-purpose accelerators) Performance modeling and application self-tuning Adoption of advanced scheduling policies Identification of performance limiting factors to accommodate on-the-fly device selection (e.g GPU vs. CPU) 28
29
Thank you for your listening! Q & A 29
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.