Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster Computing (CLUSTER), 2012 IEEE International Conference on Cluster Computing (CLUSTER), 2012 IEEE International Conference on 2013/9/111

Outline Introduction Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/112

Introduction Heterogeneous multiprocessor systems – Better power efficiency – Performance/price ratio Multicore and GPU programming techniques – OpenMP, MPI – Brook+, CUDA, OpenCL 2013/9/114

Introduction (cont.) Data-parallel scientific applications – Linear algebra routines – Digital signal processing – Computational fluid dynamics Data partitioning algorithm – Performance models of processor 2013/9/115

Introduction (cont.) Constant performance model (CPM) – Use history of performance measurement – Absolute speed of processors/devices Functional performance model (FPM) – Be used with any data-parallel application – GPU and CPU have separate memory and different programming models 2013/9/116

Introduction (cont.) Load balancing algorithm – Static algorithms Known as predicting-the future Do not require data redistribution Cannot balance on non-dedicated platforms – Dynamic algorithms Do not require a priori information Communication overhead 2013/9/117

Performance Measurement Hybrid multicore and multi-GPU node of NUMA architecture – Multiple identical cores – Hierarchical memory – Heterogeneous GPUs via the PCI Express 2013/9/119

Performance Measurement CPU – GEMM kernel from ACML 4.4 (AMD Core Math Library) GPU – CUBLAS 4.1 (NVDIA CUDA BLAS) 2013/9/1110

Performance Measurement (cont.) Approach to performance measurement – Processes are bound to cores – Processes are synchronized – Repeat multiple times 2013/9/1111

Performance Measurement (cont.) CPU – The speed of a core depended on the number of cores executing the kernel on the same socket – Wasn’t affected by the execution on the other socket GPU – One core is dedicated to the GPU, the other cores are idle – Send / Receive matrix 2013/9/1112

Column-based matrix multiplication 2013/9/1114

Column-based matrix multiplication (cont.) Partitioning algorithm – Arrange the submatrices to be as square as possible – Minimizing the total volume of communications and balancing the computations blocking factor b – a parameter of the application adjusting the granularity of communications and computations – Comes from experiment 2013/9/1115

Outline Introduction Related Work Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1116

FPM of multiple cores and GPUs Speed functions of multiple cores 2013/9/1117

FPM of multiple cores and GPUs (cont.) Speed functions of GPUs 2013/9/1118

FPM of multiple cores and GPUs (cont.) Version 1 – pivot column A (b), row B (b), submatrix C i are stored in the host memory Version 2 – submatrix C is stored and accumulated in the device until the device memory is exceeded 2013/9/1119

FPM of multiple cores and GPUs (cont.) Version 3 – Overlapping communications and computaions 2013/9/1120

FPM of multiple cores and GPUs (cont.) Speed functions of GPUs 2013/9/1121

Outline Introduction Related Work Performance Measurement Column-based matrix multiplication FPM of multiple cores and GPUs Experimental results 2013/9/1122

Experimental results 2013/9/1123

Experimental results (cont.) 2013/9/1124

Q&A 2013/9/1127

Thank you for listening 2013/9/1128

1. Performance modelling 2. The performance of the program 3. Why FPM 4. Problem size 5. Kernel 6. NUMA 7. GEMM 8. BLAS 9. GFlops 2013/9/1129

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Similar presentations

Presentation on theme: "Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Similar presentations

Presentation on theme: "Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster."— Presentation transcript:

Similar presentations

About project

Feedback