Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupat, Per Hammarlund, Ronak Singhal and Pradeep Dubey Throughput Computing Lab and Intel Architecture Group, Intel Corporation ISCA’10, June 19–23, 2010, Saint-Malo, France.

Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU2

The Rise of GPGPU CPU Designed for a wide variety of applications and to provide fast response times to a single task GPU Built speciﬁcally for rendering graphics applications that have a large degree of data parallelism 2011/09/30Embedded System Lab. CCU4

The Rise of GPGPU Designed for graphics processing with many small processing elements The massive processing capability of GPU allures programmers to start exploring general purpose computing with GPU 2011/09/30Embedded System Lab. CCU5

The Rise of GPGPU Intel Core i7 960nVIDIA GTX 280 Cores430*8 Peak SP Flop105 GF/s933 GF/s Peak BW30 GB/s141 GB/s 2011/09/30Embedded System Lab. CCU6 SP: Single-Precision Floating Point BW: Local DRAM bandwidth

The Rise of GPGPU High-level programming language – HLSL, Cg, GLSL, CTM, BrookGPU Compute Unified Device Architecture ( CUDA ) Open Computing Language ( OpenCL ) 2011/09/30Embedded System Lab. CCU7 http://www.gpgpu.org

The Rise of GPGPU 2011/09/30Embedded System Lab. CCU8 GPUs have significant performance gain Not orders of magnitude faster than CPUs

Compute / Bandwidth Bound Performance depends on two resources – Compute does the work – Bandwidth feeds the compute For compute bound applications Performance = Efficiency * Peak Compute Capability For bandwidth bound applications Performance = Efficiency * Peak Bandwidth Capability 2011/09/30Embedded System Lab. CCU10

14 Computing kernals SGEMMMonte CarloConv FFTSAXPYLBM SolvSpMVGJK SortRCSearch HistBilat 2011/09/30Embedded System Lab. CCU12

14 Computing kernals ConvCommon image filtering operation SAXPYBasic Linear Algebra Subprogram LBMLattice Boltzmann method SolvGame physics simulators GJKPhysically-based animations simulation RCRay Casting, visualize 3D dataset HistHistogram computation BilatBilateral filter 2011/09/30Embedded System Lab. CCU13

SGEMM (Single precision General Matrix Multiply) Kernel in linear algebra numerical algorithm Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU14

Monte Carlo Random samples a complex function Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU15

FFT (Fast Fourier Transform) Converts signals from time domain to frequence domain Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU16

SpMV (Sparse Matrix vector Multiplication) Sparse Matrix vector Multiplication Gather access patterns Bandwidth Bound 2011/09/30Embedded System Lab. CCU17

Sort (Radix sort) Multi-pass sorting algorithm Gather/ Scatter access patterns Compute bound 2011/09/30Embedded System Lab. CCU18

Search in-memory tree structured index search Gather/ Scatter access patterns Compute bound for small tree, otherwise Bandwidth bond 2011/09/30Embedded System Lab. CCU19

Experiment 1)a 3.2GHz Core i7-960 processor SUSE Enterprise Server 11 operating system 6GB of PC1333 DDR3 memory 2)a 1.3GHz GTX280 processor (with 1GB GDDR3 memory) in the same Core i7 system Nvidia driver version 19.180, CUDA 2.3 toolkit. 2011/09/30Embedded System Lab. CCU20

Experiment 2011/09/30Embedded System Lab. CCU21 2.5X

Optimization CPU optimization GPU optimization Hardware Recommandations 2011/09/30Embedded System Lab. CCU23

Optimization CPU optimization – Multi-threading – SIMDification – Cache blocking – Memory management – Data structure re-arrangement 2011/09/30Embedded System Lab. CCU24

Optimization GPU optimization – Multi-threading – Branch divergence reduction – Coalescing memory accesses – Synchronization avoidance – Local shared buffer optimization 2011/09/30Embedded System Lab. CCU25

Hardware Recommandations – Large Cache – High memory bandwidth – Efficient sync. – Cache coherence 2011/09/30Embedded System Lab. CCU26

Truth or Myth ? GPUs is 10 – 100x faster than CPUs

Max Speedup Intel Core i7 960nVIDIA GTX 280 Cores430*8 Peak SP Flop105 GF/s933 GF/s Peak BW30 GB/s141 GB/s 2011/09/30Embedded System Lab. CCU28 Max Speedup: GTX 280 over Core i7 960 Compute Bound Apps: (SP)933/102 = 9.1x Bandwidth Bound Apps:141/30 = 4.7x

2011/09/30Embedded System Lab. CCU29

Summary Without parallelization, GPU and CPU won’t perform Architecture specific optimization Compare performance fairly Memory is the key in Multicore 2011/09/30Embedded System Lab. CCU31

Thanks for your attention. Your Queries?

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Similar presentations

Presentation on theme: "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Similar presentations

Presentation on theme: "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,"— Presentation transcript:

Similar presentations

About project

Feedback