Download presentation
Presentation is loading. Please wait.
Published byLorraine Hubbard Modified over 9 years ago
1
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupat, Per Hammarlund, Ronak Singhal and Pradeep Dubey Throughput Computing Lab and Intel Architecture Group, Intel Corporation ISCA’10, June 19–23, 2010, Saint-Malo, France.
2
Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU2
3
Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU3
4
The Rise of GPGPU CPU Designed for a wide variety of applications and to provide fast response times to a single task GPU Built specifically for rendering graphics applications that have a large degree of data parallelism 2011/09/30Embedded System Lab. CCU4
5
The Rise of GPGPU Designed for graphics processing with many small processing elements The massive processing capability of GPU allures programmers to start exploring general purpose computing with GPU 2011/09/30Embedded System Lab. CCU5
6
The Rise of GPGPU Intel Core i7 960nVIDIA GTX 280 Cores430*8 Peak SP Flop105 GF/s933 GF/s Peak BW30 GB/s141 GB/s 2011/09/30Embedded System Lab. CCU6 SP: Single-Precision Floating Point BW: Local DRAM bandwidth
7
The Rise of GPGPU High-level programming language – HLSL, Cg, GLSL, CTM, BrookGPU Compute Unified Device Architecture ( CUDA ) Open Computing Language ( OpenCL ) 2011/09/30Embedded System Lab. CCU7 http://www.gpgpu.org
8
The Rise of GPGPU 2011/09/30Embedded System Lab. CCU8 GPUs have significant performance gain Not orders of magnitude faster than CPUs
9
Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU9
10
Compute / Bandwidth Bound Performance depends on two resources – Compute does the work – Bandwidth feeds the compute For compute bound applications Performance = Efficiency * Peak Compute Capability For bandwidth bound applications Performance = Efficiency * Peak Bandwidth Capability 2011/09/30Embedded System Lab. CCU10
11
Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU11
12
14 Computing kernals SGEMMMonte CarloConv FFTSAXPYLBM SolvSpMVGJK SortRCSearch HistBilat 2011/09/30Embedded System Lab. CCU12
13
14 Computing kernals ConvCommon image filtering operation SAXPYBasic Linear Algebra Subprogram LBMLattice Boltzmann method SolvGame physics simulators GJKPhysically-based animations simulation RCRay Casting, visualize 3D dataset HistHistogram computation BilatBilateral filter 2011/09/30Embedded System Lab. CCU13
14
SGEMM (Single precision General Matrix Multiply) Kernel in linear algebra numerical algorithm Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU14
15
Monte Carlo Random samples a complex function Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU15
16
FFT (Fast Fourier Transform) Converts signals from time domain to frequence domain Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU16
17
SpMV (Sparse Matrix vector Multiplication) Sparse Matrix vector Multiplication Gather access patterns Bandwidth Bound 2011/09/30Embedded System Lab. CCU17
18
Sort (Radix sort) Multi-pass sorting algorithm Gather/ Scatter access patterns Compute bound 2011/09/30Embedded System Lab. CCU18
19
Search in-memory tree structured index search Gather/ Scatter access patterns Compute bound for small tree, otherwise Bandwidth bond 2011/09/30Embedded System Lab. CCU19
20
Experiment 1)a 3.2GHz Core i7-960 processor SUSE Enterprise Server 11 operating system 6GB of PC1333 DDR3 memory 2)a 1.3GHz GTX280 processor (with 1GB GDDR3 memory) in the same Core i7 system Nvidia driver version 19.180, CUDA 2.3 toolkit. 2011/09/30Embedded System Lab. CCU20
21
Experiment 2011/09/30Embedded System Lab. CCU21 2.5X
22
Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU22
23
Optimization CPU optimization GPU optimization Hardware Recommandations 2011/09/30Embedded System Lab. CCU23
24
Optimization CPU optimization – Multi-threading – SIMDification – Cache blocking – Memory management – Data structure re-arrangement 2011/09/30Embedded System Lab. CCU24
25
Optimization GPU optimization – Multi-threading – Branch divergence reduction – Coalescing memory accesses – Synchronization avoidance – Local shared buffer optimization 2011/09/30Embedded System Lab. CCU25
26
Hardware Recommandations – Large Cache – High memory bandwidth – Efficient sync. – Cache coherence 2011/09/30Embedded System Lab. CCU26
27
Truth or Myth ? GPUs is 10 – 100x faster than CPUs
28
Max Speedup Intel Core i7 960nVIDIA GTX 280 Cores430*8 Peak SP Flop105 GF/s933 GF/s Peak BW30 GB/s141 GB/s 2011/09/30Embedded System Lab. CCU28 Max Speedup: GTX 280 over Core i7 960 Compute Bound Apps: (SP)933/102 = 9.1x Bandwidth Bound Apps:141/30 = 4.7x
29
2011/09/30Embedded System Lab. CCU29
30
Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU30
31
Summary Without parallelization, GPU and CPU won’t perform Architecture specific optimization Compare performance fairly Memory is the key in Multicore 2011/09/30Embedded System Lab. CCU31
32
Thanks for your attention. Your Queries?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.