An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner ytchen
Outline Introduction Motivation System Experiment results Related work Conclusion 2
Outline Introduction Motivation System Experiment results Related work Conclusion 3
Introduction High performance platforms are commonly required for scientific and engineering algorithms dealing appropriately with timing constraints. Both computation time and performance need to be optimized. Efficiency with respect to both huge domain sizes and with small problems is important. 4
Introduction Our dynamic scheduling method combines a first assignment phase for a set of high-level tasks (algorithms, for example), based on a pre- processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database. 5
Outline Introduction Motivation System Experiment results Related work Conclusion 6
Motivation 3D Computational Fluid Dynamics (CFD) large computations o velocity field o local pressure Example o planes o Cars 7
Motivation three iterative solvers for SLEs (Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient) o Jacobi: determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations. o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. 8
Outline Introduction Motivation System Experiment results Related work Conclusion 9
System overview Units of Allocation (UA): is represented as a task. 10
Platform Independent Programming Model OpenCL In its basic principle, the API encapsulates implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent. 11
Profiler and Database profiler monitors and stores tasks’ execution times and characteristics in a timing performance database. input data (size and type), data transfers between PUs, among others. 12
Profiler and Database The performance is measured in Host (CPU) counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency. 13
Dynamic Scheduler First, it establishes an initial scheduling guess over the PUs just when the applications(s) starts. o First Assignment Phase – FAP Second, for every new arriving task, it performs a scheduling consulting the timing database. o Runtime Assignment Phase – RAP 14
First Assignment Phase – FAP Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs. lowest total execution time: o m: the number of Pus m = 2 o n: the number of considered tasks o i: task o j: processor 15
16
17
18
19
20
Runtime Assignment Phase - RAP Modeled the arriving of new tasks as a FIFO (First In First Out) queue. assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain. When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size. 21
Outline Introduction Motivation System Experiment results Related work Conclusion 22
Experiment results Domain sizes and execution costs of the tasks on the PUs 23
Experiment results Comparison of allocation heuristics o 0-GPU, 1-CPU 24
Experiment results Overhead of the dynamic scheduling using ALG.2 and its gain in comparison to scheduling all tasks to the GPU 25
Experiment results Scheduling techniques for 24 tasks o Overhead: the time to perform the scheduling o Solve time: the execution time to compute the tasks o Total time: overhead + solve time o Error: the total time of the techniques in comparison to the optimal solution without it overhead ex: ( ) / 6130 o Optimal: exhaustive search 26
Experiment results Scheduling 24 tasks in the FAP + 42 tasks arriving in the RAP 27
Outline Introduction Motivation System Experiment results Related work Conclusion 28
Related work Distributed processing on a CPU-GPU platform Scheduling on a CPU-GPU platform o HEFT (Heterogeneous-Earliest-Finish-Time) 29
Related work StarPUthis paper execution modelcodeletsOpenCL methodlow-levelhigh-level motivationCFDmatrix multiplication systemruntime system scheduling database 30
Outline Introduction Motivation System Experiment results Related work Conclusion 31
Conclusion This paper presents a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications. We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL. 32
Conclusion We achieved an execution time gain of 21.77% in comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search. 33
Thanks for your listening! 34