Download presentation
Presentation is loading. Please wait.
Published byScarlett Higgins Modified over 9 years ago
1
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster Computing (CLUSTER), 2012 IEEE International Conference on Cluster Computing (CLUSTER), 2012 IEEE International Conference on 2013/9/111
2
Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/112
3
Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/113
4
Introduction Heterogeneous multiprocessor systems – Better power efficiency – Performance/price ratio Multicore and GPU programming techniques – OpenMP, MPI – Brook+, CUDA, OpenCL 2013/9/114
5
Introduction (cont.) Data-parallel scientific applications – Linear algebra routines – Digital signal processing – Computational fluid dynamics Data partitioning algorithm – Performance models of processor 2013/9/115
6
Introduction (cont.) Constant performance model (CPM) – Use history of performance measurement – Absolute speed of processors/devices Functional performance model (FPM) – Be used with any data-parallel application – GPU and CPU have separate memory and different programming models 2013/9/116
7
Introduction (cont.) Load balancing algorithm – Static algorithms Known as predicting-the future Do not require data redistribution Cannot balance on non-dedicated platforms – Dynamic algorithms Do not require a priori information Communication overhead 2013/9/117
8
Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/118
9
Performance Measurement Hybrid multicore and multi-GPU node of NUMA architecture – Multiple identical cores – Hierarchical memory – Heterogeneous GPUs via the PCI Express 2013/9/119
10
Performance Measurement CPU – GEMM kernel from ACML 4.4 (AMD Core Math Library) GPU – CUBLAS 4.1 (NVDIA CUDA BLAS) 2013/9/1110
11
Performance Measurement (cont.) Approach to performance measurement – Processes are bound to cores – Processes are synchronized – Repeat multiple times 2013/9/1111
12
Performance Measurement (cont.) CPU – The speed of a core depended on the number of cores executing the kernel on the same socket – Wasn’t affected by the execution on the other socket GPU – One core is dedicated to the GPU, the other cores are idle – Send / Receive matrix 2013/9/1112
13
Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1113
14
Column-based matrix multiplication 2013/9/1114
15
Column-based matrix multiplication (cont.) Partitioning algorithm – Arrange the submatrices to be as square as possible – Minimizing the total volume of communications and balancing the computations blocking factor b – a parameter of the application adjusting the granularity of communications and computations – Comes from experiment 2013/9/1115
16
Outline Introduction Related Work Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1116
17
FPM of multiple cores and GPUs Speed functions of multiple cores 2013/9/1117
18
FPM of multiple cores and GPUs (cont.) Speed functions of GPUs 2013/9/1118
19
FPM of multiple cores and GPUs (cont.) Version 1 – pivot column A (b), row B (b), submatrix C i are stored in the host memory Version 2 – submatrix C is stored and accumulated in the device until the device memory is exceeded 2013/9/1119
20
FPM of multiple cores and GPUs (cont.) Version 3 – Overlapping communications and computaions 2013/9/1120
21
FPM of multiple cores and GPUs (cont.) Speed functions of GPUs 2013/9/1121
22
Outline Introduction Related Work Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1122
23
Experimental results 2013/9/1123
24
Experimental results (cont.) 2013/9/1124
25
Experimental results (cont.) 2013/9/1125
26
Experimental results (cont.) 2013/9/1126
27
Q&A 2013/9/1127
28
Thank you for listening 2013/9/1128
29
1. Performance modelling 2. The performance of the program 3. Why FPM 4. Problem size 5. Kernel 6. NUMA 7. GEMM 8. BLAS 9. GFlops 2013/9/1129
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.