Presentation is loading. Please wait.

Presentation is loading. Please wait.

Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ...

Similar presentations


Presentation on theme: "Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ..."— Presentation transcript:

1 Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ...
Olivier Aumage, Jérôme Clet-Ortega, Nathalie Furmento, Suraj Kumar, Marc Sergent, Samuel Thibault INRIA Storm Team-Project INRIA Bordeaux, LaBRI, University of Bordeaux

2 Introduction Toward heterogeneous multi-core architectures
Multicore is here Hierarchical architectures Manycore Architecture specialization Now Accelerators (GPGPUs, FPGAs) Coprocessors (Xeon Phi) Fusion DSPs All of the above In the near Future Many simple cores A few full-featured cores Mixed Large and Small Cores

3 How to program these architectures?
Multicore Multicore programming pthreads, OpenMP, TBB, ... OpenMP TBB Cilk MPI CPU CPU CPU CPU M.

4 How to program these architectures?
Accelerators Multicore programming pthreads, OpenMP, TBB, ... Accelerator programming OpenCL was supposed to be consensus OpenMP 4.0? (Often) Pure offloading model OpenCL CUDA libspe ATI Stream CPU CPU *PU M. CPU CPU *PU M. M.

5 How to program these architectures?
Multicore programming pthreads, OpenMP, TBB, ... Accelerator programming OpenCL was supposed to be consensus OpenMP 4.0? (Often) Pure offloading model Network support MPI MPI PGAS Network CPU CPU *PU M. CPU CPU *PU M. M. N.

6 How to program these architectures?
Multicore Accelerators Multicore programming pthreads, OpenMP, TBB, ... Accelerator programming OpenCL was supposed to be consensus OpenMP 4.0? (Often) Pure offloading model Network support MPI Hybrid models? Take advantage of all resources ☺ Complex interactions and distribution ☹ OpenCL OpenMP CUDA libspe TBB Cilk MPI PGAS ? ? ATI Stream MPI Network CPU CPU *PU M. CPU CPU *PU M. M. N.

7 Task graphs Really ? Well-studied expression of parallelism
Departs from usual sequential programming Really ?

8 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

9 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

10 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

11 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

12 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

13 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

14 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

15 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

16 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

17 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

18 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

19 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

20 Task management Right-Looking Cholesky decomposition (from PLASMA)
Implicit task dependencies Right-Looking Cholesky decomposition (from PLASMA)

21 Write your application as a task graph
Even if using a sequential-looking source code Portable performance Sequential Task Flow (STF) Algorithm remains the same on the long term Can debug the sequential version. Only kernels need to be rewritten BLAS libraries, multi-target compilers Runtime will handle parallel execution

22 Task graphs everywhere in HPC
OmpSs, PARSEC (aka Dague), StarPU, SuperGlue/DuctTeip, XKaapi... OpenMP4.0 introduced task dependencies Plasma/magma, state of the art dense linear algebra qr_mumps/PaStiX, state of the art sparse linear algebra ScalFMM-MORSE Very portable MORSE associate-team (Matrices Over Runtime Exascale)

23 Challenging issues at all stages
Applications Programming paradigm BLAS kernels, FFT, … Compilers Languages Code generation/optimization Runtime systems Resources management Task scheduling Architecture Memory interconnect Compiling environment HPC Applications Runtime system Operating System Hardware Specific librairies

24 Challenging issues at all stages
Expressive interface Applications Programming paradigm BLAS kernels, FFT, … Compilers Languages Code generation/optimization Runtime systems Resources management Task scheduling Architecture Memory interconnect Compiling environment HPC Applications Runtime system Operating System Hardware Specific librairies Execution Feedback

25 Challenging issues at all stages
Expressive interface Applications Programming paradigm BLAS kernels, FFT, … Compilers Languages Code generation/optimization Runtime systems Resources management Task scheduling Architecture Memory interconnect Compiling environment HPC Applications Runtime system Operating System Hardware Specific librairies Scheduling expertise Execution Feedback

26 Overview of StarPU

27 Overview of StarPU Rationale Task scheduling Memory transfer Dynamic
A = A+B Rationale Task scheduling Dynamic On all kinds of PU General purpose Accelerators/specialized Memory transfer Eliminate redundant transfers Software VSM (Virtual Shared Memory) CPU CPU GPU M. CPU CPU GPU M. B M. B A GPU M. CPU CPU CPU CPU GPU M. A M.

28 The StarPU runtime system
HPC Applications High-level data management library Execution model Scheduling engine Specific drivers CPUs GPUs SPUs ... Mastering CPUs, GPUs, SPUs … *PUs → StarPU

29 The StarPU runtime system
The need for runtime systems “do dynamically what can’t be done statically anymore” Compilers and libraries generate (graphs of) tasks Additional information is welcome! StarPU provides Task scheduling Memory management HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU

30 Data management StarPU provides a Virtual Shared Memory (VSM) subsystem Replication Consistency Single writer Or reduction, ... Input & ouput of tasks = reference to VSM data HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU

31 The StarPU runtime system
Task scheduling Tasks = Data input & output Reference to VSM data Multiple implementations E.g. CUDA + CPU implementation Non-preemptible Dependencies with other tasks StarPU provides an Open Scheduling platform Scheduling algorithm = plug-ins HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) f cpu gpu spu (ARW, BR, CR) CPU GPU

32 The StarPU runtime system
Task scheduling Who generates the code ? StarPU Task ~= function pointers StarPU doesn't generate code Libraries era PLASMA + MAGMA FFTW + CUFFT... Rely on compilers PGI accelerators CAPS HMPP... HPC Applications Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) f cpu gpu spu (ARW, BR, CR) CPU GPU

33 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k A B

34 The StarPU runtime system Execution model
Application A+= B Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k A B Submit task « A += B »

35 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k A B Schedule task

36 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A B Fetch data

37 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B Fetch data

38 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A+= B A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B Fetch data

39 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B A+= B Offload computation

40 The StarPU runtime system Execution model
Application Memory Management (DSM) Scheduling engine A B GPU driver CPU driver #k ... StarPU RAM GPU CPU#k B A A B Notify termination

41 The StarPU runtime system
Development context History Started about 9 years ago PhD Thesis of Cédric Augonnet StarPU main core ~ 60k lines of code Written in C Open Source Released under LGPL Sources freely available svn repository and nightly tarballs See Open to external contributors [HPPC'08] [Europar'09] – [CCPE'11],... >950 citations

42 Programming interface

43 Scaling a vector Register a piece of data to StarPU Submit tasks….
Data registration Register a piece of data to StarPU float array[NX]; for (unsigned i = 0; i < NX; i++) array[i] = 1.0f; starpu_data_handle vector_handle; starpu_vector_data_register(&vector_handle, 0, array, NX, sizeof(vector[0])); Submit tasks…. Unregister data starpu_data_unregister(vector_handle);

44 Scaling a vector Codelet = multi-versionned kernel
Defining a codelet (4) Codelet = multi-versionned kernel Function pointers to the different kernels Number of data parameters managed by StarPU starpu_codelet scal_cl = { .cpu_func = scal_cpu_func, .cuda_func = scal_cuda_func, .opencl_func = scal_opencl_func, .nbuffers = 1, .modes = STARPU_RW };

45 Scaling a vector Define a task that scales the vector by a constant
Defining a task Define a task that scales the vector by a constant struct starpu_task *task = starpu_task_create(); task->cl = &scal_cl; task->buffers[0].handle = vector_handle; float factor = 3.14; task->cl_arg = &factor; task->cl_arg_size = sizeof(factor); starpu_task_submit(task); starpu_task_wait(task);

46 Scaling a vector Define a task that scales the vector by a constant
Defining a task, starpu_insert_task helper Define a task that scales the vector by a constant float factor = 3.14; starpu_insert_task( &scal_cl, STARPU_RW, vector_handle, STARPU_VALUE,&factor,sizeof(factor), 0);

47 Scaling a vector Define a task that scales the vector by a constant
Defining a task, OpenMP support from K'Star Define a task that scales the vector by a constant float factor = 3.14; #pragma omp task depend(inout:vector) scal(vector, factor);

48 Task management Right-Looking Cholesky decomposition (from Chameleon)
Implicit task dependencies Right-Looking Cholesky decomposition (from Chameleon) For (k = 0 .. tiles – 1) { POTRF(A[k,k]) for (m = k+1 .. tiles – 1) TRSM(A[k,k], A[m,k]) for (m = k+1 .. tiles – 1) { SYRK(A[m,k], A[m,m]) for (n = m+1 .. tiles – 1) GEMM(A[m,k], A[n,k], A[n,m]) }

49 Task Scheduling

50 Why do we need task scheduling ?
Blocked Matrix multiplication Things can go (really) wrong even on trivial problems ! Static mapping ? Not portable, too hard for real-life problems Need Dynamic Task Scheduling Performance models 2 Xeon cores Quadro FX5800 Quadro FX4600

51 Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined Scheduler Pop Pop CPU workers GPU workers

52 Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined c Pop CPU workers GPU workers

53 Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined ? CPU workers GPU workers

54 Task scheduling Push When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Then, the task is “pushed” to the scheduler Idle processing units poll for work (“pop”) Various scheduling policies, can even be user-defined ? CPU workers GPU workers

55 Task scheduling ? Component-based schedulers Containers Priorities
Push Component-based schedulers Containers Priorities Switches Side-effects (prefetch, …) Push/Pull mechanism S. Archipoff, M. Sergent ? P P P P P P CPU workers GPU workers

56 Task scheduling ? Component-based schedulers Containers Priorities
Push Component-based schedulers Containers Priorities Switches Side-effects (prefetch, …) Push/Pull mechanism S. Archipoff, M. Sergent ? P P P P P P CPU workers GPU workers

57 Task scheduling ? Component-based schedulers Containers Priorities
Push Component-based schedulers Containers Priorities Switches Side-effects (prefetch, …) Push/Pull mechanism S. Archipoff, M. Sergent ? P P P CPU workers GPU workers

58 Prediction-based scheduling Load balancing
Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

59 Prediction-based scheduling Load balancing
Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

60 Prediction-based scheduling Load balancing
Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

61 Prediction-based scheduling Load balancing
Task completion time estimation History-based User-defined cost function Parametric cost model [HPPC'09] Can be used to implement scheduling E.g. Heterogeneous Earliest Finish Time cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

62 Predicting data transfer overhead Motivations
Hybrid platforms Multicore CPUs and GPUs PCI-e bus is a precious ressource Data locality vs. Load balancing Cannot avoid all data transfers Minimize them StarPU keeps track of data replicates on-going data movements CPU CPU GPU M. CPU CPU M. GPU B M. B A GPU M. CPU CPU CPU CPU GPU M. A M.

63 Prediction-based scheduling Load balancing
Data transfer time Sampling based on off-line calibration Can be used to Better estimate overall exec time Minimize data movements Further Power overhead [ICPADS'10] cpu #1 cpu #2 cpu #3 gpu #1 gpu #2 Time

64 Mixing PLASMA and MAGMA with StarPU
QR decomposition Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060)

65 Mixing PLASMA and MAGMA with StarPU
QR decomposition Mordor8 (UTK) : 16 CPUs (AMD) + 4 GPUs (C1060) +12 CPUs ~200GFlops vs measured ~150Gflops ! Thanks to heterogeneity

66 Mixing PLASMA and MAGMA with StarPU
« Super-Linear » efficiency in QR? Kernel efficiency sgeqrt CPU: 9 Gflops GPU: 30 Gflops (Speedup : ~3) stsqrt CPU: 12Gflops GPU: 37 Gflops (Speedup: ~3) somqr CPU: 8.5 Gflops GPU: 227 Gflops (Speedup: ~27) Sssmqr CPU: 10Gflops GPU: 285Gflops (Speedup: ~28) Task distribution observed on StarPU sgeqrt: 20% of tasks on GPUs Sssmqr: 92.5% of tasks on GPUs Taking advantage of heterogeneity ! Only do what you are good for Don't do what you are not good for

67 Sparse matrix algebra

68 Cluster support Master/Slave mode Master Unrolls the whole task graph
Schedules tasks between nodes Taking data transfer cost into account Using state-of-the-art scheduling Currently also schedules task inside nodes But currently working on leaving that to nodes Data transfers Currently using MPI Could easily use other network drivers Reasonable scaling

69 How about the disk? GPU memory Main memory Disk memory
StarPU out-of-core support Disk = memory node Just like main memory is, compared to GPU memory Two usages « Swap » for least-used data starpu_disk_register(«swap»); Huge matrix stored on disk, parts loaded and evicted on- demand starpu_disk_open(«file.dat»); Can be network filesystem StarPU handles load/store on demand GPU memory Main memory Disk memory

70 How about the disk? GPU memory Main memory Disk memory
StarPU out-of-core support Disk = memory node Just like main memory is, compared to GPU memory Two usages « Swap » for least-used data starpu_disk_register(«swap»); Huge matrix stored on disk, parts loaded and evicted on- demand starpu_disk_open(«file.dat»); Can be network filesystem StarPU handles load/store on demand GPU memory Main memory Disk memory Network disk memory

71 How about the disk with cluster support ?
StarPU out-of-core support Network Disk = shared memory node Local Disk = cache RAM RAM RAM Disk Disk Disk Network disk memory

72 Simulation with SimGrid
Run application natively on target system Records performance models Rebuild application against simgrid-compiled StarPU Run again Uses performance model estimations instead of actually executing tasks Way faster execution time Reproducible experiments No need to run on target system Can change system architecture

73 Simulation with SimGrid
Run application natively on target system Records performance models Rebuild application against simgrid-compiled StarPU Run again Uses performance model estimations instead of actually executing tasks Way faster execution time Reproducible experiments No need to run on target system Can change system architecture

74 Theoretical “area” and CP bound
We would not be able to do much better Express task graph as Linear or Constraint Programming problem With heterogeneous task durations, and heterogeneous resources + Taking into account some critical paths + Constraint Programming problem

75 Theoretical “area” and CP bound
We would not be able to do much better Express task graph as Linear or Constraint Programming problem With heterogeneous task durations, and heterogeneous resources

76 Theoretical “area” and CP bound
We would not be able to do much better Express task graph as Linear or Constraint Programming problem With heterogeneous task durations, and heterogeneous resources

77 Conclusion Summary Tasks Nice programming model
Keep sequential program! Scales to large platforms Runtime playground Scheduling playground Algorithmic playground Used for various computations Cholesky/QR/LU (dense/sparse), FFT, stencil, CG, FMM… HPC Applications Parallel Compilers Parallel Libraries Scheduling expertise Runtime system Operating System CPU GPU

78 Conclusion Future work Load balancing for MPI At submission time!
Shouldn’t harm too much Granularity is a major concern Finding the optimal block size ? Offline parameters auto-tuning Dynamically adapt block size Parallel CPU tasks OpenMP, TBB, PLASMA // tasks How to dimension parallel sections ? Divisible tasks Who decides to divide tasks ?

79 Thanks for your attention !
Conclusion Future work Load balancing for MPI At submission time! Shouldn’t harm too much Granularity is a major concern Finding the optimal block size ? Offline parameters auto-tuning Dynamically adapt block size Parallel CPU tasks OpenMP, TBB, PLASMA // tasks How to dimension parallel sections ? Divisible tasks Who decides to divide tasks ? Thanks for your attention !


Download ppt "Seamlessly distributing tasks between CPUs, GPUs, disks, clusters, ..."

Similar presentations


Ads by Google