An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Slides:

Advertisements

Similar presentations

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Advertisements

Hadi Goudarzi and Massoud Pedram

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Numerical Algorithms Matrix multiplication

OpenFOAM on a GPU-based Heterogeneous Cluster

Multilevel Incomplete Factorizations for Non-Linear FE problems in Geomechanics DMMMSA – University of Padova Department of Mathematical Methods and Models.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

Chapter 5 Data mining : A Closer Look.

GPGPU platforms GP - General Purpose computation using GPU

ITERATIVE TECHNIQUES FOR SOLVING NON-LINEAR SYSTEMS (AND LINEAR SYSTEMS)

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

VOLTAGE SCHEDULING HEURISTIC for REAL-TIME TASK GRAPHS D. Roychowdhury, I. Koren, C. M. Krishna University of Massachusetts, Amherst Y.-H. Lee Arizona.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

HPEC 2004 Sparse Linear Solver for Power System Analysis using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.

المحاضرة الاولى Operating Systems. The general objectives of this decision explain the concepts and the importance of operating systems and development.

Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,

Robin McDougall Scott Nokleby Mechatronic and Robotic Systems Laboratory 1.

1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Parallel Solution of the Poisson Problem Using MPI

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Monte-Carlo based Expertise A powerful Tool for System Evaluation & Optimization  Introduction  Features  System Performance.

Interactive educational system for coal combustion modeling in Power Plant boilers Marek Gayer, Pavel Slavík and František Hrdlička Computer.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

Programming assignment # 3 Numerical Methods for PDEs Spring 2007 Jim E. Jones.

An Introduction to Computational Fluids Dynamics Prapared by: Chudasama Gulambhai H ( ) Azhar Damani ( ) Dave Aman ( )

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.

Parallel Direct Methods for Sparse Linear Systems

ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Dynamo: A Runtime Codesign Environment

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Lecture 19 MA471 Fall 2003.

FTCS Explicit Finite Difference Method for Evaluating European Options

Collaborative Offloading for Distributed Mobile-Cloud Apps

GPU Implementations for Finite Element Methods

STUDY AND IMPLEMENTATION

CSCE569 Parallel Computing

GENERAL VIEW OF KRATOS MULTIPHYSICS

Utility-Function based Resource Allocation for Adaptable Applications in Dynamic, Distributed Real-Time Systems Presenter: David Fleeman {

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Maximizing Speedup through Self-Tuning of Processor Allocation

Presentation transcript:

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner ytchen

Outline Introduction Motivation System Experiment results Related work Conclusion 2

Outline Introduction Motivation System Experiment results Related work Conclusion 3

Introduction High performance platforms are commonly required for scientific and engineering algorithms dealing appropriately with timing constraints. Both computation time and performance need to be optimized. Efficiency with respect to both huge domain sizes and with small problems is important. 4

Introduction Our dynamic scheduling method combines a first assignment phase for a set of high-level tasks (algorithms, for example), based on a pre- processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database. 5

Outline Introduction Motivation System Experiment results Related work Conclusion 6

Motivation 3D Computational Fluid Dynamics (CFD) large computations o velocity field o local pressure Example o planes o Cars 7

Motivation three iterative solvers for SLEs (Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient) o Jacobi: determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations. o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. 8

Outline Introduction Motivation System Experiment results Related work Conclusion 9

System overview Units of Allocation (UA): is represented as a task. 10

Platform Independent Programming Model OpenCL In its basic principle, the API encapsulates implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent. 11

Profiler and Database profiler monitors and stores tasks’ execution times and characteristics in a timing performance database. input data (size and type), data transfers between PUs, among others. 12

Profiler and Database The performance is measured in Host (CPU) counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency. 13

Dynamic Scheduler First, it establishes an initial scheduling guess over the PUs just when the applications(s) starts. o First Assignment Phase – FAP Second, for every new arriving task, it performs a scheduling consulting the timing database. o Runtime Assignment Phase – RAP 14

First Assignment Phase – FAP Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs. lowest total execution time: o m: the number of Pus m = 2 o n: the number of considered tasks o i: task o j: processor 15

16

17

18

19

20

Runtime Assignment Phase - RAP Modeled the arriving of new tasks as a FIFO (First In First Out) queue. assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain. When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size. 21

Outline Introduction Motivation System Experiment results Related work Conclusion 22

Experiment results Domain sizes and execution costs of the tasks on the PUs 23

Experiment results Comparison of allocation heuristics o 0-GPU, 1-CPU 24

Experiment results Overhead of the dynamic scheduling using ALG.2 and its gain in comparison to scheduling all tasks to the GPU 25

Experiment results Scheduling techniques for 24 tasks o Overhead: the time to perform the scheduling o Solve time: the execution time to compute the tasks o Total time: overhead + solve time o Error: the total time of the techniques in comparison to the optimal solution without it overhead ex: ( ) / 6130 o Optimal: exhaustive search 26

Experiment results Scheduling 24 tasks in the FAP + 42 tasks arriving in the RAP 27

Outline Introduction Motivation System Experiment results Related work Conclusion 28

Related work Distributed processing on a CPU-GPU platform Scheduling on a CPU-GPU platform o HEFT (Heterogeneous-Earliest-Finish-Time) 29

Related work StarPUthis paper execution modelcodeletsOpenCL methodlow-levelhigh-level motivationCFDmatrix multiplication systemruntime system scheduling database 30

Outline Introduction Motivation System Experiment results Related work Conclusion 31

Conclusion This paper presents a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications. We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL. 32

Conclusion We achieved an execution time gain of 21.77% in comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search. 33

Thanks for your listening! 34