1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks 2011-04-19 8.

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
DSPs Vs General Purpose Microprocessors
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
Multi-core architectures. Single-core computer Single-core CPU chip.
Cg Programming Mapping Computational Concepts to GPUs.
Multi-Core Architectures
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 GPGPU Clusters Hardware Performance Model & SW: cudaMPI, glMPI, and MPIglut Orion Sky Lawlor U. Alaska Fairbanks
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
My Coordinates Office EM G.27 contact time:
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
The Present and Future of Parallelism on GPUs
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
3- Parallel Programming Models
Parallel Objects: Virtualization & In-Process Components
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
What is Parallel and Distributed computing?
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks

2 Obligatory Introductory Quote “He who controls the past, controls the future.” George Orwell, 1984

3 In Parallel Programming... “He who controls the past, controls the future.” George Orwell, 1984 “He who controls the writes, controls performance.” Orion Lawlor, 2011

4 Talk Outline Existing parallel programming models Who controls the writes? Charm++ and Charm-- Charm-style GPGPU Conclusions

5 Existing Model: Superscalar Hardware parallelization of a sequential programming model Fetch future instructions Need good branch prediction Runtime Dependency Analysis Load/store buffer for mem-carried Rename away false dependencies RAR, WAR, WAW, -> RAW <- Now “solved”: low future gain

6 Spacetime Data Arrows Read Write “time” (program order) “space” (memory, node)

7 Read After Write Dependency Read Write Read Write Artificial Instruction Boundary

8 Read After Read: No Problem! Read Write Read Write Artificial Instruction Boundary

9 Existing Model: Shared Memory OpenMP, threads, shmem “Just” let different processors access each others' memory HW: Cache coherence false sharing, cache thrashing SW: Synchronization locks, semaphores, fences,... Correctness is a huge issue Weird race conditions abound New bugs in 10+ year old code

10 Gather: Works Fine Distributed Reads Centralized Writes

11 Scatter: Tough to Synchronize! Oops!

12 Existing Model: Message Passing MPI, sockets Explicit control of parallel reads (send) and writes (recv) Far fewer race conditions Programmability is an issue Raw byte-based interface (C style) High per-message cost (alpha) Synchronization issues: when does MPI_Send block?

13 Existing Model: SIMD SSE, AVX, and GPU Single Instruction, Multiple Data Far fewer fetches & decodes Far higher arithmetic intensity CPU: Programmability N/A Assembly language (hello, 1984!) mmintrin.h wrappers: _mm_add_ps Or pray for automagic compiler! GPU: Programmability OK Graphics side: GLSL, HLSL, Cg GPGPU: CUDA, OpenCL, DX CS

14 NVIDIA CUDA CPU calls a GPU “kernel” with a “block” of threads Now fully programmable (throw/catch, virtual methods, recursion, etc) ‏ Read and write memory anywhere Zero protection against multithreaded race conditions Manual control over a small __shared__ memory region Only runs on NVIDIA hardware (OpenCL is portable... sorta)

15 OpenGL: GL Shading Language Mostly programmable (loops, etc) ‏ Can read anywhere in “textures”, only write to “framebuffer” (2D/3D arrays) ‏ Reads go through “texture cache”, so performance is good (iff locality) ‏ Writes are on space-filling curve Writes are controlled by the graphics driver So cannot have synchronization bugs! Rich selection of texture filtering (array interpolation) modes Includes mipmaps, for multigrid GLSL can run OK on every modern GPU (well, except Intel...)

16 GLSL vs CUDA GLSL Programs

17 GLSL vs CUDA GLSL Programs CUDA Programs Mipmaps; texture writes Arbitrary writes

18 GLSL vs CUDA GLSL Programs CUDA Programs Correct Programs

19 GLSL vs CUDA GLSL Programs CUDA Programs Correct Programs High Performance Programs

20 GPU/CPU Convergence GPU, per socket: SIMD: way SMT: 2-50 way (register limited) SMP: 4-36 way CPUs will get there, soon! SIMD: 8 way AVX (64-way SWAR) SMT: 2 way Intel; 4 way IBM SMP: 6-8 way/socket already Intel has shown 48 way chips Biggest difference: CPU has branch prediction & superscalar!

21 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block Kernel startup latency: 4us Kernel output bandwidth: 80 GB/s t = 4000ns / kernel + bytes * ns / byte

Charm++ and “Charm--”

23 Existing Model: Charm++ Chares send each other messages Runtime system does delivery Scheduling! Migration with efficient forwarding Cheap broadcasts Runtime system schedules Chares Overlap comm and compute Programmability still an issue Per-message overhead, even with message combining library Collect up your messages (SDAG?) Cheap SMP reads? SIMD? GPU?

24 Entry Method One Charm++ Method Invocation Receive one message (but in what order?) Send messages Chare Messages Update internal state Read internal state Between send and receive: migration, checkpointing,...

25 The Future: SIMD AVX, SSE, AltiVec, GPU, etc Thought experiment Imagine a block of 8 chares living in one SIMD register Deliver 8 messages at once (!) Or imagine 100K chares living in GPU RAM Locality (mapping) is important! Branch divergence penalty Struct-of-Arrays member storage xxxxxxxx yyyyyyyy zzzzzzzz Members of 8 separate chares!

26 Vision: Charm-- Stencil array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } };

27 Vision: Charm-- Explained array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; Assembled into GPU arrays or SSE vectors

28 Vision: Charm-- Explained array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; Broadcast out to blocks of array elements

29 Vision: Charm-- Explained array [2D] stencil { public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); } }; Hides local synchronized reads, network, and domain boundaries

30 Vision: Charm-- Springs array [1D] sim_spring { public: float restlength; [entry] void netforce( sim_vertex ends[2]=fetch_ends()) { vec3 along=ends[1].pos- ends[0].pos; float f=-k*(length(along)- restlength); vec3 F=f*normalize(along); ends[0].netforce+=F; ends[1].netforce-=F; } };

31 One Charm-- Method Invocation Fetch together multiple messages Send off network messages Chare (on GPU) “Mainchare” (on CPU) Update internal states Read internal states

32 Noncontiguous Communication Network Data Buffer GPU Target Buffer Run scatter kernel Or fold into fetch

33 Key Charm-- Design Features Multiple chares receive message at once Runtime block-allocates incoming and outgoing message storage Critical for SIMD, GPU, SMP Receive multiple messages in one entry point Minimize roundtrip to GPU Explicit support for timesteps E.g., double-buffer message storage

34 Charm-- Not Shown Lots of work in “mainchare” Controls decomposition & comms Set up “fetch” Still lots of network work Gather & send off messages Distribute incoming messages Division of labor? Application scientist writes Chare Computer scientist writes Mainchare

35 Related Work Charm++ Accelerator API [Wesolowski] Pipeline CUDA copy, queue kernels Good backend for Charm-- Intel ArBB: SIMD from kernel Based on RapidMind But GPU support? My “GPGPU” library Based on GLSL

The Future

37 The Future: Memory Bandwidth Today: 1TF/s, but only 0.1TB/s Don't communicate, recompute multistep stencil methods “fetch” gets even more complex! 64-bit -> 32-bit -> 16-bit -> 8? Spend flops scaling the data Split solution + residual storage Most flops use fewer bits, in residual Fight roundoff with stochastic rounding Add noise to improve precision

38 Conclusions C++ is dead. Long live C++! CPU and GPU on collision course SIMD+SMT+SMP+network Software is the bottleneck Exciting time to build software! Charm-- model Support ultra-low grainsize chares Combine into SIMD blocks at runtime Simplify programmer's life Add flexibility for runtime system BUT must scale to real applications!