Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014

2 18,688 NVIDIA Tesla K20X GPUs (each having 2688 cores) 20 petaflops Upgraded from Jaguar supercomputer. 10 times faster and 5 times more energy efficient than 2.3- petaflops Jaguar system while occupying the same floor space. Titan Supercomputer Oak Ridge National Laboratory in Oak Ridge, Tenn World’s fastest computer on TOP500 list Nov 2012 – May 2013 http://nvidianews.nvidia.com/Releases/NVIDIA-Powers-Titan-World-s-Fastest-Supercomputer-For-Open-Scientific-Research-8a0.aspx#source=pr

3 Tesla K20 GPU Computing modules Kepler architecture. Introduced November 2012 K20 – 2496 thread processors K20X – 2688 thread processors K40 – 2880 thread processors (2013) GFLOPs: Single Precision: 3519 - 4106 Double Precision: 1173

5 CPU-GPU architecture evolution 1970s - 1980s Co-processors -- very old idea appeared in 1970s and 1980s -- floating point coprocessors attached to microprocessors that did not then have floating point capability. Coprocessors simply executed floating point instructions that were fetched from memory. Graphics cards -- Around same time, hardware support for displays, especially with increasing use of graphics and PC games. Led to graphics processing units (GPUs) attached to CPU to create video display. Early designs 2013: Xeon Phi processor with 60 cores is described as a co-processor although connected thro a PCIe interface in a similar fashion to recent GPU cards. CPU Memory Co-processor CPU Graphics card Display

6 Pipelined programmable GPU Dedicated pipeline (late1990s-early 2000s) By late1990’s, graphics chips needed to support 3-D graphics, especially for games and graphics. APIs such as DirectX and OpenGL. Generally had a pipeline structure with individual stages performing specialized operations, finally leading to loading frame buffer for display. Individual stages may have access to graphics memory for storing intermediate computed data. Input stage Vertex shader stage Geometry shader stage Rasterizer stage Frame buffer Pixel shading stage Graphics memory

7 General-Purpose GPU designs High performance pipelines call for high-speed (IEEE) floating point operations. People tried to use GPU cards to speed up scientific computations Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult to do with specialized graphics pipelines, but possible.) By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented by a more general purpose processor core (although with a data-parallel paradigm) a

Graphics Processing Units (GPUs) Brief History 1970 2010 200019901980 Atari 8-bit computer text/graphics chip Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit IBM PC Professional Graphics Controller card S3 graphics cards- single chip 2D accelerator OpenGL graphics API Hardware-accelerated 3D graphics DirectX graphics API Playstation GPUs with programmable shading Nvidia GeForce GE 3 (2001) with programmable shading General-purpose computing on graphics processing units (GPGPUs) GPU Computing

NVIDIA products NVIDIA Corp. a leader in GPUs for high performance computing: 1993201019991995 http://en.wikipedia.org/wiki/GeForce 2009200720082000200120022003200420052006 Established by Jen- Hsun Huang, Chris Malachowsky, Curtis Priem NV1GeForce 1 GeForce 2 series GeForce FX series GeForce 8 series GeForce 200 series GeForce 400 series GTX460/465/470/475/ 480/485 GTX260/275/280/285/295 GeForce 8800 GT 80 Tesla Quadro NVIDIA's first GPU with general purpose processors C870, S870, C1060, S1070, C2050, … C2050 GPU has 448 thread processors Fermi Kepler (2011) Maxwell (2013) Tesla Kepler K20 GPU has 2496 thread processors

10 NVIDIA GT 80 chip/GeForce 8800 card (2006) First GPU for high performance computing as well as graphics Unified processors that could perform vertex, geometry, pixel, and general computing operations Could now write programs in C rather than graphics APIs. Single-instruction multiple thread (SIMT) prog. model

11 * Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008 Data parallel single instruction multiple data operation (“Stream” processing) Up to 512 cores (“stream processing engines”, SPEs, organized as 16 SPEs, each having 32 SPEs) 3GB or 6 GB GDDR5 memory Many innovations including L1/L2 caches, unified device memory addressing, ECC memory, … First implementation: Tesla 20 series (single chip C2050/2070, 4 chip S2050/2070) 3 billion transistor chip? Number of cores limited by power considerations, C2050 has 448 cores. Evolving GPU design: NVIDIA Fermi architecture (announced Sept 2009)

12 GPU performance gains over CPUs T12 Westmere NV30 NV40 G70 G80 GT200 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

13 NVIDIA Kepler architecture and GPUs (2012+) A lot of major new features over earlier Fermi architecture K10/GK104 1536 cores K20/GK110 2496 cores K40/GK180 2880 cores CUDA Computer Capability 3.0 see next http://www.tomshardware.com/news/Nvid ia-Kepler-GK104-GeForce-GTX-670- 680,14691.html GK104 chip with 1536 cores

NVIDA GPUs Stream processing -- Term used to denote processing of a stream of instructions operating in a data parallel fashion. Stream Processors (SPs) – theeexecution cores that will execute the stream. Each stream processor has compute resources such as register file, instruction scheduler, … Streaming multiprocessors (SMs) -- groups of streaming processors that shares control logic and cache.

15 14 streaming multiprocessor (SMs) Each streaming multiprocessor has 32 streaming processor (SPs) So 448 streaming processor (cores) Apparently Fermi was originally intended to have 512 cores (16 SM) but design got too hot. NVIDIA C2050 (as on coit-grid06.uncc.edu and cci-grid07)

16 13 streaming multiprocessor (SMXs, extreme) Each streaming multiprocessor has 192 streaming processor (SPs) So 2496 streaming processor (cores) Actually 15 SMs (2880 core) fabricated on chip to improve yield. NVIDIA K20 (as on coit-grid08)

17 CUDA (Compute Unified Device Architecture) Architecture and programming model introduced in NVIDIA in 2007 Enables GPUs to execute programs written in C. Within C programs, call SIMT “kernel” routines that are executed on GPU. CUDA syntax extension to C identify routine as a Kernel. Very easy to learn although to get highest possible execution performance requires understanding of hardware architecture. Version 3 introduced 2009 Version 4 introduced 2011 – significant additions including “unified virtual addressing” – a single address space across GPU and host. Most recent version 5.5 introduced July 2013 We will go into CUDA in detail shortly and have programming experiences.

Questions

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Similar presentations

Presentation on theme: "Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Similar presentations

Presentation on theme: "Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014."— Presentation transcript:

Similar presentations

About project

Feedback