Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ...

Slides:

Advertisements

Similar presentations

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and Programmability for Heterogeneous Many-core Architectures.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Introductory Courses in High Performance Computing at Illinois David Padua.

OpenFOAM on a GPU-based Heterogeneous Cluster

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Martin Kruliš by Martin Kruliš (v1.1)1.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Introduction to Operating Systems Concepts

Productive Performance Tools for Heterogeneous Parallel Computing

Generalized and Hybrid Fast-ICA Implementation using GPU

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Processes and threads.

Current Generation Hypervisor Type 1 Type 2.

A survey of Exascale Linear Algebra Libraries for Data Assimilation

Ioannis E. Venetis Department of Computer Engineering and Informatics

CS427 Multicore Architecture and Parallel Computing

Conception of parallel algorithms

Enabling machine learning in embedded systems

Task Scheduling for Multicore CPUs and NUMA Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

CSCI1600: Embedded and Real Time Software

Chapter 4: Threads.

Chapter 4: Threads.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Lecture Topics: 11/1 General Operating System Concepts Processes

Hybrid Programming with OpenMP and MPI

Chapter 2: Operating-System Structures

Introduction to Operating Systems

Chapter 4: Threads & Concurrency

Chapter 01: Introduction

Multicore and GPU Programming

Chapter 2: Operating-System Structures

Multicore and GPU Programming

CSCI1600: Embedded and Real Time Software

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Presentation transcript:

Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ... Olivier Aumage, Jérôme Clet-Ortega, Nathalie Furmento, Suraj Kumar, Marc Sergent, Samuel Thibault INRIA Storm Team-Project INRIA Bordeaux, LaBRI, University of Bordeaux

Introduction Toward heterogeneous multi-core architectures Multicore is here Hierarchical architectures Manycore Architecture specialization Now Accelerators (GPGPUs, FPGAs) Coprocessors (Xeon Phi) Fusion DSPs All of the above In the near Future Many simple cores A few full-featured cores Mixed Large and Small Cores

How to program these architectures? Multicore Multicore programming pthreads, OpenMP, TBB, ... OpenMP TBB Cilk MPI CPU CPU CPU CPU M.

How to program these architectures? Accelerators Multicore programming pthreads, OpenMP, TBB, ... Accelerator programming OpenCL was supposed to be consensus OpenMP 4.0? (Often) Pure offloading model OpenCL CUDA libspe ATI Stream CPU CPU *PU M. CPU CPU *PU M. M.

How to program these architectures? Multicore programming pthreads, OpenMP, TBB, ... Accelerator programming OpenCL was supposed to be consensus OpenMP 4.0? (Often) Pure offloading model Network support MPI MPI PGAS Network CPU CPU *PU M. CPU CPU *PU M. M. N.

How to program these architectures? Multicore Accelerators Multicore programming pthreads, OpenMP, TBB, ... Accelerator programming OpenCL was supposed to be consensus OpenMP 4.0? (Often) Pure offloading model Network support MPI Hybrid models? Take advantage of all resources ☺ Complex interactions and distribution ☹ OpenCL OpenMP CUDA libspe TBB Cilk MPI PGAS ? ? ATI Stream MPI Network CPU CPU *PU M. CPU CPU *PU M. M. N.

Task graphs Really ? Well-studied expression of parallelism Departs from usual sequential programming Really ?

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Task management Right-Looking Cholesky decomposition (from PLASMA) Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

Write your application as a task graph Even if using a sequential-looking source code Portable performance Sequential Task Flow (STF) Algorithm remains the same on the long term Can debug the sequential version. Only kernels need to be rewritten BLAS libraries, multi-target compilers Runtime will handle parallel execution

Task graphs everywhere in HPC OmpSs, PARSEC (aka Dague), StarPU, SuperGlue/DuctTeip, XKaapi... OpenMP4.0 introduced task dependencies Plasma/magma, state of the art dense linear algebra qr_mumps/PaStiX, state of the art sparse linear algebra ScalFMM-MORSE … Very portable MORSE associate-team (Matrices Over Runtime Systems @ Exascale)

Challenging issues at all stages Applications Programming paradigm BLAS kernels, FFT, … Compilers Languages Code generation/optimization Runtime systems Resources management Task scheduling Architecture Memory interconnect Compiling environment HPC Applications Runtime system Operating System Hardware Specific librairies

Challenging issues at all stages Expressive interface Applications Programming paradigm BLAS kernels, FFT, … Compilers Languages Code generation/optimization Runtime systems Resources management Task scheduling Architecture Memory interconnect Compiling environment HPC Applications Runtime system Operating System Hardware Specific librairies Execution Feedback

Challenging issues at all stages Expressive interface Applications Programming paradigm BLAS kernels, FFT, … Compilers Languages Code generation/optimization Runtime systems Resources management Task scheduling Architecture Memory interconnect Compiling environment HPC Applications Runtime system Operating System Hardware Specific librairies Scheduling expertise Execution Feedback

Overview of StarPU

Overview of StarPU Rationale Task scheduling Memory transfer Dynamic A = A+B Rationale Task scheduling Dynamic On all kinds of PU General purpose Accelerators/specialized Memory transfer Eliminate redundant transfers Software VSM (Virtual Shared Memory) CPU CPU GPU M. CPU CPU GPU M. B M. B A GPU M. CPU CPU CPU CPU GPU M. A M.

The StarPU runtime system HPC Applications High-level data management library Execution model Scheduling engine Specific drivers CPUs GPUs SPUs ... Mastering CPUs, GPUs, SPUs … *PUs → StarPU

The StarPU runtime system The need for runtime systems “do dynamically what can’t be done statically anymore” Compilers and libraries generate (graphs of) tasks Additional information is welcome! StarPU provides Task scheduling Memory management HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU …

Data management StarPU provides a Virtual Shared Memory (VSM) subsystem Replication Consistency Single writer Or reduction, ... Input & ouput of tasks = reference to VSM data HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU …

The StarPU runtime system Task scheduling Tasks = Data input & output Reference to VSM data Multiple implementations E.g. CUDA + CPU implementation Non-preemptible Dependencies with other tasks StarPU provides an Open Scheduling platform Scheduling algorithm = plug-ins HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) f cpu gpu spu (ARW, BR, CR) CPU GPU …

The StarPU runtime system Task scheduling Who generates the code ? StarPU Task ~= function pointers StarPU doesn't generate code Libraries era PLASMA + MAGMA FFTW + CUFFT... Rely on compilers PGI accelerators CAPS HMPP... HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) f cpu gpu spu (ARW, BR, CR) CPU GPU …

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k A B

The StarPU runtime system Execution model Application A+= B Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k A B Submit task « A += B »

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k A B Schedule task

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A B Fetch data

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B Fetch data

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B Fetch data

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B A+= B Offload computation

The StarPU runtime system Execution model Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B Notify termination

The StarPU runtime system Development context History Started about 9 years ago PhD Thesis of Cédric Augonnet StarPU main core ~ 60k lines of code Written in C Open Source Released under LGPL Sources freely available svn repository and nightly tarballs See https://starpu.gforge.inria.fr/ Open to external contributors [HPPC'08] [Europar'09] – [CCPE'11],... >950 citations

Programming interface

Scaling a vector Register a piece of data to StarPU Submit tasks…. Data registration Register a piece of data to StarPU float array[NX]; for (unsigned i = 0; i < NX; i++) array[i] = 1.0f; starpu_data_handle vector_handle; starpu_vector_data_register(&vector_handle, 0, array, NX, sizeof(vector[0])); Submit tasks…. Unregister data starpu_data_unregister(vector_handle);

Scaling a vector Codelet = multi-versionned kernel Defining a codelet (4) Codelet = multi-versionned kernel Function pointers to the different kernels Number of data parameters managed by StarPU starpu_codelet scal_cl = { .cpu_func = scal_cpu_func, .cuda_func = scal_cuda_func, .opencl_func = scal_opencl_func, .nbuffers = 1, .modes = STARPU_RW };

Scaling a vector Define a task that scales the vector by a constant Defining a task Define a task that scales the vector by a constant struct starpu_task *task = starpu_task_create(); task->cl = &scal_cl; task->buffers[0].handle = vector_handle; float factor = 3.14; task->cl_arg = &factor; task->cl_arg_size = sizeof(factor); starpu_task_submit(task); starpu_task_wait(task);

Scaling a vector Define a task that scales the vector by a constant Defining a task, starpu_insert_task helper Define a task that scales the vector by a constant float factor = 3.14; starpu_insert_task( &scal_cl, STARPU_RW, vector_handle, STARPU_VALUE,&factor,sizeof(factor), 0);

Scaling a vector Define a task that scales the vector by a constant Defining a task, OpenMP support from K'Star Define a task that scales the vector by a constant float factor = 3.14; #pragma omp task depend(inout:vector) scal(vector, factor);

Task management Right-Looking Cholesky decomposition (from Chameleon) Implicit task dependencies Right-Looking Cholesky decomposition (from Chameleon) For (k = 0 .. tiles – 1) { POTRF(A[k,k]) for (m = k+1 .. tiles – 1) TRSM(A[k,k], A[m,k]) for (m = k+1 .. tiles – 1) { SYRK(A[m,k], A[m,m]) for (n = m+1 .. tiles – 1) GEMM(A[m,k], A[n,k], A[n,m]) }

Task Scheduling

Why do we need task scheduling ? Blocked Matrix multiplication Things can go (really) wrong even on trivial problems ! Static mapping ? Not portable, too hard for real-life problems Need Dynamic Task Scheduling Performance models 2 Xeon cores Quadro FX5800 Quadro FX4600

Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined Scheduler Pop Pop CPU workers GPU workers

Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined c Pop CPU workers GPU workers

Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined ? CPU workers GPU workers

Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined ? CPU workers GPU workers

Task scheduling ? Component-based schedulers Containers Priorities Push Component-based schedulers Containers Priorities Switches Side-effects (prefetch, …) Push/Pull mechanism S. Archipoff, M. Sergent ? P P P P P P CPU workers GPU workers

Task scheduling ? Component-based schedulers Containers Priorities Push Component-based schedulers Containers Priorities Switches Side-effects (prefetch, …) Push/Pull mechanism S. Archipoff, M. Sergent ? P P P P P P CPU workers GPU workers

Task scheduling ? Component-based schedulers Containers Priorities Push Component-based schedulers Containers Priorities Switches Side-effects (prefetch, …) Push/Pull mechanism S. Archipoff, M. Sergent ? P P P CPU workers GPU workers

Prediction-based scheduling Load balancing Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

Prediction-based scheduling Load balancing Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

Prediction-based scheduling Load balancing Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

Prediction-based scheduling Load balancing Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

Predicting data transfer overhead Motivations Hybrid platforms Multicore CPUs and GPUs PCI-e bus is a precious ressource Data locality vs. Load balancing Cannot avoid all data transfers Minimize them StarPU keeps track of data replicates on-going data movements CPU CPU GPU M. CPU CPU M. GPU B M. B A GPU M. CPU CPU CPU CPU GPU M. A M.

Prediction-based scheduling Load balancing Data transfer time Sampling based on off-line calibration Can be used to Better estimate overall exec time Minimize data movements Further Power overhead [ICPADS'10] cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

Mixing PLASMA and MAGMA with StarPU QR decomposition Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

Mixing PLASMA and MAGMA with StarPU QR decomposition Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060) +12 CPUs ~200GFlops vs measured ~150Gflops ! Thanks to heterogeneity

Mixing PLASMA and MAGMA with StarPU « Super-Linear » efficiency in QR? Kernel efficiency sgeqrt CPU: 9 Gflops GPU: 30 Gflops (Speedup : ~3) stsqrt CPU: 12Gflops GPU: 37 Gflops (Speedup: ~3) somqr CPU: 8.5 Gflops GPU: 227 Gflops (Speedup: ~27) Sssmqr CPU: 10Gflops GPU: 285Gflops (Speedup: ~28) Task distribution observed on StarPU sgeqrt: 20% of tasks on GPUs Sssmqr: 92.5% of tasks on GPUs Taking advantage of heterogeneity ! Only do what you are good for Don't do what you are not good for

Sparse matrix algebra

Cluster support Master/Slave mode Master Unrolls the whole task graph Schedules tasks between nodes Taking data transfer cost into account Using state-of-the-art scheduling Currently also schedules task inside nodes But currently working on leaving that to nodes Data transfers Currently using MPI Could easily use other network drivers Reasonable scaling

How about the disk? GPU memory Main memory Disk memory StarPU out-of-core support Disk = memory node Just like main memory is, compared to GPU memory Two usages « Swap » for least-used data starpu_disk_register(«swap»); Huge matrix stored on disk, parts loaded and evicted on- demand starpu_disk_open(«file.dat»); Can be network filesystem StarPU handles load/store on demand GPU memory Main memory Disk memory

How about the disk? GPU memory Main memory Disk memory StarPU out-of-core support Disk = memory node Just like main memory is, compared to GPU memory Two usages « Swap » for least-used data starpu_disk_register(«swap»); Huge matrix stored on disk, parts loaded and evicted on- demand starpu_disk_open(«file.dat»); Can be network filesystem StarPU handles load/store on demand GPU memory Main memory Disk memory Network disk memory

How about the disk with cluster support ? StarPU out-of-core support Network Disk = shared memory node Local Disk = cache RAM RAM RAM Disk Disk Disk Network disk memory

Simulation with SimGrid Run application natively on target system Records performance models Rebuild application against simgrid-compiled StarPU Run again Uses performance model estimations instead of actually executing tasks Way faster execution time Reproducible experiments No need to run on target system Can change system architecture

Simulation with SimGrid Run application natively on target system Records performance models Rebuild application against simgrid-compiled StarPU Run again Uses performance model estimations instead of actually executing tasks Way faster execution time Reproducible experiments No need to run on target system Can change system architecture

Theoretical “area” and CP bound We would not be able to do much better Express task graph as Linear or Constraint Programming problem With heterogeneous task durations, and heterogeneous resources + Taking into account some critical paths + Constraint Programming problem

Theoretical “area” and CP bound We would not be able to do much better Express task graph as Linear or Constraint Programming problem With heterogeneous task durations, and heterogeneous resources

Theoretical “area” and CP bound We would not be able to do much better Express task graph as Linear or Constraint Programming problem With heterogeneous task durations, and heterogeneous resources

Conclusion Summary Tasks Nice programming model Keep sequential program! Scales to large platforms Runtime playground Scheduling playground Algorithmic playground Used for various computations Cholesky/QR/LU (dense/sparse), FFT, stencil, CG, FMM… http://starpu.gforge.inria.fr HPC Applications Parallel Compilers Parallel Libraries Scheduling expertise Runtime system Operating System CPU GPU …

Conclusion Future work Load balancing for MPI At submission time! Shouldn’t harm too much Granularity is a major concern Finding the optimal block size ? Offline parameters auto-tuning Dynamically adapt block size Parallel CPU tasks OpenMP, TBB, PLASMA // tasks How to dimension parallel sections ? Divisible tasks Who decides to divide tasks ? http://starpu.gforge.inria.fr/

Thanks for your attention ! Conclusion Future work Load balancing for MPI At submission time! Shouldn’t harm too much Granularity is a major concern Finding the optimal block size ? Offline parameters auto-tuning Dynamically adapt block size Parallel CPU tasks OpenMP, TBB, PLASMA // tasks How to dimension parallel sections ? Divisible tasks Who decides to divide tasks ? http://starpu.gforge.inria.fr/ Thanks for your attention !