Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Similar presentations


Presentation on theme: "Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster."— Presentation transcript:

1 Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster Computing (CLUSTER), 2012 IEEE International Conference on Cluster Computing (CLUSTER), 2012 IEEE International Conference on 2013/9/111

2 Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/112

3 Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/113

4 Introduction Heterogeneous multiprocessor systems – Better power efficiency – Performance/price ratio Multicore and GPU programming techniques – OpenMP, MPI – Brook+, CUDA, OpenCL 2013/9/114

5 Introduction (cont.) Data-parallel scientific applications – Linear algebra routines – Digital signal processing – Computational fluid dynamics Data partitioning algorithm – Performance models of processor 2013/9/115

6 Introduction (cont.) Constant performance model (CPM) – Use history of performance measurement – Absolute speed of processors/devices Functional performance model (FPM) – Be used with any data-parallel application – GPU and CPU have separate memory and different programming models 2013/9/116

7 Introduction (cont.) Load balancing algorithm – Static algorithms Known as predicting-the future Do not require data redistribution Cannot balance on non-dedicated platforms – Dynamic algorithms Do not require a priori information Communication overhead 2013/9/117

8 Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/118

9 Performance Measurement Hybrid multicore and multi-GPU node of NUMA architecture – Multiple identical cores – Hierarchical memory – Heterogeneous GPUs via the PCI Express 2013/9/119

10 Performance Measurement CPU – GEMM kernel from ACML 4.4 (AMD Core Math Library) GPU – CUBLAS 4.1 (NVDIA CUDA BLAS) 2013/9/1110

11 Performance Measurement (cont.) Approach to performance measurement – Processes are bound to cores – Processes are synchronized – Repeat multiple times 2013/9/1111

12 Performance Measurement (cont.) CPU – The speed of a core depended on the number of cores executing the kernel on the same socket – Wasn’t affected by the execution on the other socket GPU – One core is dedicated to the GPU, the other cores are idle – Send / Receive matrix 2013/9/1112

13 Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1113

14 Column-based matrix multiplication 2013/9/1114

15 Column-based matrix multiplication (cont.) Partitioning algorithm – Arrange the submatrices to be as square as possible – Minimizing the total volume of communications and balancing the computations blocking factor b – a parameter of the application adjusting the granularity of communications and computations – Comes from experiment 2013/9/1115

16 Outline Introduction Related Work Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1116

17 FPM of multiple cores and GPUs Speed functions of multiple cores 2013/9/1117

18 FPM of multiple cores and GPUs (cont.) Speed functions of GPUs 2013/9/1118

19 FPM of multiple cores and GPUs (cont.) Version 1 – pivot column A (b), row B (b), submatrix C i are stored in the host memory Version 2 – submatrix C is stored and accumulated in the device until the device memory is exceeded 2013/9/1119

20 FPM of multiple cores and GPUs (cont.) Version 3 – Overlapping communications and computaions 2013/9/1120

21 FPM of multiple cores and GPUs (cont.) Speed functions of GPUs 2013/9/1121

22 Outline Introduction Related Work Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1122

23 Experimental results 2013/9/1123

24 Experimental results (cont.) 2013/9/1124

25 Experimental results (cont.) 2013/9/1125

26 Experimental results (cont.) 2013/9/1126

27 Q&A 2013/9/1127

28 Thank you for listening 2013/9/1128

29 1. Performance modelling 2. The performance of the program 3. Why FPM 4. Problem size 5. Kernel 6. NUMA 7. GEMM 8. BLAS 9. GFlops 2013/9/1129


Download ppt "Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster."

Similar presentations


Ads by Google