Download presentation
Presentation is loading. Please wait.
Published byBruno Walton Modified over 9 years ago
1
Thinking Outside of the Tera-Scale Box Piotr Luszczek
2
Brief History of Tera-flop: 1997 1997 ASCI Red
3
Brief History of Tera-flop: 2007 1997 2007 ASCI Red Intel Polaris
4
Brief History of Tera-flop: GPGPU 2008 1997 2007 ASCI Red Intel Polaris GPGPU
5
Tera-flop: a Dictionary Perspective Belongs to supercomputer expert Consumes 500 kW Costs over $1M Requires distributed memory programming Memory: 1 GiB or more Belongs to gaming geek Consumes under 500 W Costs under $200 Requires only a CUDA kernel Memory: 1 GiB or more
6
Double-Dip Recession? 20072005 Time Productivity Hybrid Multicore Multicore Cilk, Intel TBB, Intel Ct, Apple GCD, MS Concrt, … HMPP, PGI Accelerator, PathScale?
7
Locality Space in HPC Challenge HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess
8
Locality Space in HPC Challenge: Bounds HPL Mat-Mat-Mul Parallel-Transpose STREAM FFTRandomAccess Latency-bound Bandwidth-bound Compute-bound AORSA2D PCIe
9
CUDA vs. OpenCL: Similar Software Stacks
10
Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j) CUBLAS Volkov et al. MAGMA Nakasato
11
Fermi C2050: from CUDA to OpenCL
12
Fermi C2050: Simple OpenCL Porting
13
ATI 5870: Simple OpenCL Porting
14
Making Case for Autotuning: Use of Texture
15
Making Case for Autotuning: Blocking Factors Provided by NVIDIA
16
Mat-Mul Portability Summary Achieved Performance Single precision Achieved Performance Double precision ATI OpenCL 52%57% ATI IL 74%87% NVIDIA CUDA 63%58% NVIDIA OpenCL 57%49% Multicore (vendor) 79% Multicore (autotuned) 46%40%
17
Now, let’s think outside of the box
18
Recipe for Performance Before Multicore Wider superscalar instr. issue Increase clock frequency Increase number of transistor Load/Store Instr. scheduling
19
Recipe for Performance After Multicore More cores Parallel software
20
Recipe for Future Performance More cores Better sequential software DAG Runtime
21
From Assembly to DAG: PLASMA Typical loop in assembly –load R0, #1000 –divf F1, F2, F3 –load R1, #1000 –divf F2, F4, F5 –dec R1 –jump … LU factorization –for i = 1:N – GETR2(A[i, i]) – for j = i+1:N – TRSM(A[i, i], A[i, j]) – for k = i+1:N – GEMM(A[k,i],A[i,j],A[k,j]) Instr. scheduling GETRF2 TRSM GETRF2 GEMM TRSM GEMM DAG scheduling Sequential instruction stream Sequential task stream
22
Scalability of LU Factorization on Multicore PLASMA MKL 11.0 LAPACK 6 cores = 1 socket 24 cores = 4 sockets 48 cores = 8 sockets AMD Istanbul 2.8 GHz
23
Combining Multicore and GPU for QR Fact. AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480
24
Performance on Distributed Memory QR LU Cholesky ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus Cray LibSci DPLASMA Peak
25
People and Projects Jack Dongarra Mathieu Faverge Jakub Kurzak Hatem Ltaief Stan Tomov Asim YarKhan Peng Du Rick Weber MAGMA PLASMA DPLASMA and DAGuE –George Bosilca –Aurelien Bouteiller –Anthony Danalis –Thomas Herault –Pierre Lemarinier
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.