GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/

Graphics Processing Units ● GPUs and other accelerators take top-500 by storm ● Many application success stories ● But GPUs are notoriously difficult to program and optimize http://www.nvidia.com/object/ tesla-case-studies.html

Example 1: convolution ● About half a Ph.D. thesis Naive Fully optimized

Example 2: Auto-Tuning Dedispersion ● >100 configurations, the winner is an outlier ● (Number of threads, amount of work per thread, etc.)

Example 3: Parallel Programming Lab course ● Master course practical (next to lectures) ● CUDA: ● Simple image processing application on 1 node ● MPI: ● Parallel all pairs shortest path algorithms ● CUDA: 11 out of 21 passed (52 %) ● MPI: 17 out of 21 passed (80 %)

Questions ● Why are accelerators so difficult to program? ● What are the challenges for Computer Science? ● What role do applications play?

Background ● Netherlands eScience Center ● Bridge between ICT and applications (climate modeling, astronomy, water management, digital forensics, …..) ● COMMIT/ ● COMMIT (100 M€): public-private Dutch ICT program ● Distributed ASCI Supercomputer ● Testbed for Dutch Computer Science COMMIT/

Background (team) Ph.D. students ● Ben van Werkhoven ● Alessio Sclocco ● Ismail El Hewl ● Pieter Hijma Staff ● Rob van Nieuwpoort (NLeSC) ● Ana Varbanescu (UvA) Scientific programmers ● Rutger Hofman ● Ceriel Jacobs

Differences CPUs and CPUs ● Different goals produce different designs ● CPU must be good at everything, parallel or not ● GPU assumes work load is highly parallel ● CPU: minimize latency of 1 thread ● Big on-chip caches ● Sophisticated control logic ● GPU: maximize throughput of all threads ● Multithreading can hide latency → no big caches ● Share control logic across many threads Control ALU Cache

Example: NVIDIA Maxwell ● 16 independent streaming multiprocessors (SM) ● 128 cores per SM (2048 total) ● 96KB shared memory

Thread hierarchy ● All threads execute the same sequential kernel ● Threads are grouped into thread blocks ● Threads in same block execute on same SM and can work together (synchronize) ● 32 contiguous threads execute as a warp and execute the same instruction in parallel ● Thread blocks are grouped into grid ● Many thousands of threads in total, scheduled by hardware, without preemption

Hierarchy of concurrent threads

Memory hierarchy (NVIDIA example) ● Shared memory ● Small, fast, on SM, allocated to thread blocks ● Register file ● On SM, private per thread ● Global memory ● Large, off-chip, slow, accessible by host ● Constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Constant Memory

Agenda Application case studies Multimedia kernel (convolution) Astronomy kernel (dedispersion) Climate modelling: optimizing multiple kernels Programming methodologies Stepwise refinement: new methodology & model Glasswing: MapReduce on accelerators

Application case study 1: Convolution operations Image I of size I w by I h Filter F of size F w by F h Thread block of size B w by B h CUDA kernel Does 2 arithmetic operations and 2 loads (8 bytes) Arithmetic Intensity (AI) = 0.25

Tiled convolution 16x16 thread block processing an 11x 7 filter Filter goes into constant memory (small) Threads within a block cooperatively load entire area needed by all threads in the block into shared memory

Analysis ● If filter size increases: ● Arithmetic Intensive increases: ● Kernel shifts from memory-bandwidth bound to compute-bound ● Amount of shared memory needed increases → fewer thread blocks can run concurrently on each SM

Tiling ● Each thread block computes 1xN tiles in horizontal direction +Increases amount of work per thread +Saves loading overlapping borders +Saves redundant instructions -More shared memory, fewer concurrent thread blocks No shared memory bank conflicts

Adaptive tiling ● Tiling factor is selected at runtime depending on the input data and the resource limitations of the device ● Highest possible tiling factor that fits within the shared memory available (depending on filter size) ● Plus loop unrolling, memory banks, search optimal configuration

Lessons learned ● Everything must be in balance to obtain high performance ● Subtle interactions between resource limits ● Runtime Ddecision system (adaptive tiling), in combination with standard optimizations ● Loop unrolling, memory bank conflics Ph.D. thesis Ben van Werkhoven, 27 Oct. 2014 FGCS journal, 2014

Application case study 2: Dedispersion ● Auto-Tuning dedispersion for many-core Accelerators ● Used for searching pulsars in radio astronomy data Alessio Sclocco et al.: Auto-Tuning Dedispersion for Many-Core Accelerators, IPDPS 2014

Dedispersion ● Pulsar signals get dispersed: lower radio frequencies arrive progressively later ● Non-linear function of distance between source & receiver ● Can be reversed by shifting in time the signal’s lower frequencies (dedispersion)

Auto-tuning ● Using auto-tuning to adapt the algorithm for: ● Different many-core platforms ● NVIDIA&AMD GPUs, Intel Phi, Xeon, … ● Different observational scenarios ● LOFAR, Apertif ● Different number of Dispersion Measures (DMs) ● Represents number of free electrons between source & receiver ● Measure of distance between emitting object & receiver ● Parameters: ● Number of threads per sample or DM, thread block size, number of registers per thread, ….

Auto-Tuning: Registers LOFAR Apertif

Example histogram

Speedup over Best Fixed Configuration

Lessons learned ● Autotuning allows algorithms to adapt to different platforms and scenarios ● The impact that auto-tuning has on dedispersion is significant ● Guessing a good configuration without auto-tuning is difficult

Application case study 3: High-Resolution Global Climate Modeling ● Understand future local sea level changes ● Quantify the effect of changes in freshwater input & ocean circulation on regional sea level height in the Atlantic ● To obtain high resolution, use: ● Distributed computing (multiple resources) ● Based on research on wide-area optimizations done 16 years ago on DAS-1 (Albatross project) ● GPU Computing ● Good example of application-inspired Computer Science research COMMIT/

Distributed Computing ● Use Ibis to couple different simulation models ● Land, ice, ocean, atmosphere ● Wide-area optimizations like hierarchical load balancing

Enlighten Your Research Global award EMERALD (UK) KRAKEN (USA) STAMPEDE (USA) SUPERMUC (GER) #7 #10 10G CARTESIUS (NLD) 10G

GPU Computing ● Offload expensive kernels for Parallel Ocean Program (POP) from CPU to GPU ● Many different kernels, fairly easy to port to GPUs ● Vertical mixing / Barotropic solvers / ‘State’ calculation …. ● Execution time becomes virtually 0 ● New bottleneck: moving data between CPU & GPU CPU host memory GPU device memory Host Device PCI Express link

Different methods for CPU-GPU communication ● Memory copies (explicit) ● No overlap with GPU computation ● Device-mapped host memory (implicit) ● Allows fine-grained overlap between computation and communication in either direction ● CUDA Streams or OpenCL command-queues ● Allows overlap between computation and communication in different streams ● Any combination of the above

Problem ● Problem: ● Which method will be most efficient for a given GPU kernel? Implementing all can be a large effort ● Solution: ● Create a performance model that identifies the best implementation: ● What implementation strategy for overlapping computation and communication is best for my program? Ben van Werkhoven, Jason Maassen, Frank Seinstra & Henri Bal: Performance models for CPU-GPU data transfers, CCGrid2014 (nominated for best-paper-award)

Analytical performance model ● What implementation strategy for overlapping computation and communication is best for my program? ● Other questions: ● How much performance is gained from overlapping? ● Is the PCIe link actually a bottleneck in my program? ● What number of streams is likely give the best performance? ● How will future architectures impact my program’s performance?

Example result ● Implicit Synchronization and 1 copy engine ● 2 POP kernels (state and buoydiff) ● GTX 680 connected over PCIe 2.0 MeasuredModel

Different GPUs buoydiff state

Comes with spreadsheet

Lessons learned ● PCIe transfers can have a large performance impact ● Several methods for transferring data and overlapping computation & communication exist ● Performance modelling helps to select the best mechanism

Why is GPU programming hard? ● Mapping algorithm to architecture is difficult, especially as the architecture is difficult: ● Many levels of parallelism ● Limited resources (registers, shared memory) ● Less of everything (except parallelism) than CPU (especially per thread), makes problem-partitioning difficult ● Everything must be in balance to obtain performance ● Subtle interactions between resource limits

Why is GPU programming hard? ● Many crucial high-impact optimizations needed: ● Data reuse ● Use shared memory efficiently ● Limited by #registers per thread, shared memory per thread block ● Memory access patterns ● Avoid shared memory bank conflicts ● Global memory coalescing ● Instruction stream optimization ● Control flow divergence ● Thread level parallelism ● Maximize occupancy: avoid that all warps get stalled ● Loop unrolling

Why is GPU programming hard? ● Portability ● Optimizations are architecture-dependent, and the architectures change frequently ● Optimizations are often input dependent ● Finding the right parameters settings is difficult ● Need better performance models ● Like Roofline and our I/O model

Agenda Application case studies Multimedia kernel (convolution) Astronomy kernel (dedispersion) Optimizing the I/O of multiple kernels Climate modelling Programming methodologies MapReduce: using existing programming model Stepwise refinement: new methodology & model

Programming methodology: stepwise refinement for performance ● Programming accelerators: ● Tension: control over hardware vs. abstraction-level ● Methodology: ● Integrate hardware descriptions into programming model ● Programmers can work on multiple levels of abstraction ● Performance feedback from compiler, based on hardware description and kernel ● Cooperation between compiler and programmer P. Hijma et al., Stepwise-refinement for Performance: a methodology for many-core programming,” Concurrency and Computation: Practice and Experience (accepted)

MCL: Many-Core Levels ● MCL program is an algorithm mapped to hardware ● Start at a suitable abstraction level ● E.g. idealized accelerator, NVIDIA Kepler GPU, Xeon Phi ● MCL compiler guides programmer which optimizations to apply on given abstraction level or to move to deeper levels

Comparison ● High-level programming ● BSP, D&C, Nested Data-parallelism, array-extensions, …. ● Automatic optimizers ● Skeletons ● Separation of concerns ● Domain-specific languages ● Tuning cycle aproach ● CUDA/OpenCL + performance modeling/profiling

MCL ecosystem

Example

MCL example (Pieter)

Convolution in MCL

Examples compiler feedback ● Use local [= CONSTANT?] memory for Filter ● Use shared memory for Input (data reuse) ● Compute multiple elements per thread (2×2 tiles) ● Try to maximize the number of blocks per SMP. This depends on the number of threads, amount of shared memory and the number of registers ● Trade-off: more data-reuse, fewer blocks per SM ● Change tiling to 2 blocks per SM ● No loop-unrolling in compiler yet

Performance (GTX480, 9×9 filters) 380 GFLOPS MCL: 302 GFLOPS

Performance evaluation

Status ● Prototype implementation for various accelerators ● GTX480, Xeon Phi ● Various algorithms and small applications ● Source code available at: ?????? ● Current work: integration with Satin divide-and- conquer system: ● Almost all possible levels of parallelism, from GPUs to (ultimately) wide-area systems

Glasswing: MapReduce on Accelerators ● Big Data revolution ● Designed for cheap commodity hardware ● Scales horizontally ● Coarse-grained parallelism ● MapReduce on modern hardware? Ismail El Helw, Rutger Hofman, Henri Bal [HPDC’2014, SC’2014]

MapReduce

MapReduce model

Rethinking MapReduce ● Use accelerators (OpenCL) as mainstream feature ● Massive out-of-core data sets ● Scale vertically & horizontally ● Maintain MapReduce abstraction

Glasswing Pipeline ● Overlaps computation, communication & disk access ● Supports multiple buffering levels

GPU optimizations ● Glasswing framework does: ● Custom memory allocators ● Shared memory optimizations (partially in framework, partially in kernels) ● Atomic operations on shared memory, aggregate, then global memory ● Data movements (cf. CUDA streams), data staging ● Programmer may do: ● Kernel optimizations (coalescing, memory banks, etc.)

Evaluation on DAS-4 ● 64-node cluster ● Dual quad-core Intel Xeon 2.4GHz CPU ● 24GB of memory ● 2x1TB disks (RAID0) ● 16 nodes equipped with Nvidia GTX480 GPUs ● QDR Inniband

Glasswing vs. Hadoop 64-node CPU cluster

Glasswing vs. Hadoop 16-Node GPU Cluster

Performance k-Means on CPU

Performance k-Means on GPU

Compute Device Comparison

Glasswing conclusions ● Scalable MapReduce framework combining coarse- grained and fine-grained parallelism ● Handles out-of-core data, sticks with MapReduce model ● Overlaps kernel executions with memory transfers, network communication and disk access ● Current work: ● Machine learning applications ● Energy efficiency

Wrap-up: no single solution ● eScience applications need performance of GPUs ● GPU programming is very time-consuming ● Need new methodologies: ● Autotuning ● Determine optimal configuration in Dedispersion or Convolution ● Compile-time, runtime, or adaptive ● Performance modelling ● Compiler-based reasoning about performance ● Frameworks, templates, patterns (like MapReduce)

Future ● Challenges for CS ● Applications

EXTRA/NOT USED

A Common Programming Strategy  Partition data into subsets that fit into shared memory 71

A Common Programming Strategy  Handle each data subset with one thread block 72

A Common Programming Strategy  Load the subset from device memory to shared memory, using multiple threads to exploit memory- level parallelism 73

A Common Programming Strategy  Perform the computation on the subset from shared memory 74

A Common Programming Strategy  Copy the result from shared memory back to device memory 75

Distributed ASCI Supercomputer  Distributed common infrastructure for Dutch Computer Science  Distributed: multiple (4-6) clusters at different locations  Common: single formal owner (ASCI), single design team Users have access to entire system  Dedicated to CS experiments (like Grid’5000) Interactive (distributed) experiments, low resource utilization Able to modify/break the hardware and systems software Going Dutch: How to Share a Dedicated Distributed Infrastructure for Computer Science Research, keynote lecture at Euro-Par 2014 (Porto, 28 August 2014), http://www.cs.vu.nl/~bal/Talks/Europar2014.pptx

DAS generations: visions  DAS-1: Wide-area computing (1997)  Homogeneous hardware and software  DAS-2: Grid computing (2002)  Globus middleware  DAS-3: Optical Grids (2006)  Dedicated 10 Gb/s optical links between all sites  DAS-4: Clouds, diversity, green IT (2010)  Hardware virtualization, accelerators, energy measurements  DAS-5: Harnessing diversity, data-explosion (2015)  Wide variety of accelerators, larger memories and disks

ASCI (1995)  Research schools (Dutch product from 1990s), aims:  Stimulate top research & collaboration  Provide Ph.D. education (courses)  ASCI: Advanced School for Computing and Imaging  About 100 staff & 100 Ph.D. Students  16 PhD level courses  Annual conference

DAS-4 (2011) Testbed for Clouds, diversity, green IT Dual quad-core Xeon E5620 Infiniband Various accelerators Scientific Linux Bright Cluster Manager Built by ClusterVision VU (74) TU Delft (32)Leiden (16) UvA/MultimediaN (16/36) SURFnet6 ASTRON (23) 10 Gb/s

Accelerators VU cluster  23 NVidia GTX480  2 Nvidia C2050 Tesla GPU;  NVidia GTX680 GPU  1 node with X5650 CPU (dual 6-cores, 2.67 GHz)  1 48-core (quad socket "Magny Cours") AMD system  1 AMD Radeon HD7970  1 Intel "Sandy Bridge" E5-2630 node (2.3 GHz), NVidia GTX-Titan.  ? Intel "Sandy Bridge" E5-2620 (2.0 GHz) nodes, K20m "Kepler" GPU.  2 Intel Xeon Phi accelerator.

Bird-eye view 81 core Memory channel core Memory channel CPUmany-core

[[[Memory Spaces in CUDA]]] Grid Device Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

Key architectural ideas  Data parallel, like a vector machine  There, 1 thread issues parallel vector instructions  SIMT (Single Instruction Multiple Thread) execution  Many threads work on a vector, each on a different element  They all execute the same instruction  Hardware automatically handles divergence  Hardware multithreading  HW resource allocation & thread scheduling  HW relies on threads to hide latency  Context switching is (basically) free

http://www.nvidia.com/object/tesla- case-studies.html

CUDA Model of Parallelism  CUDA virtualizes the physical hardware  Block is a virtualized streaming multiprocessor  Thread is a virtualized scalar processor  Scheduled onto physical hardware without pre-emption  Threads/blocks launch & run to completion  Blocks should be independent Block Shared Memory Shared Memory Block Shared Memory Shared Memory Device Memory

ALTERNATIVES registers Device memory L1 cache / shared mem L2 cache Host memory PCI-e bus

Thread Per-thread Local Memory SM Per-SM Shared Memory Kernel 0 Per-device Global Memory … Kernel 1 …

Host Grid Device Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Constant Memory

Performance k-Means on CPU

Performance k-Means on GPU

Rough plan: Explain why GPUs are hard to program Applications Multimedia Astronomy [illustrate why GPU programming is hard] Climate modelling [Multiple kernel, I/O: climate problem more or less solved] Programming methodologies MapReduce [existing programming model] Stepwise refinement [a step towards a solution]

Application case study 3: Global Climate Modeling  Netherlands eScience Center:  Builds bridges between applications & ICT (Ibis, JavaGAT)  Frank Seinstra, Jason Maassen, Maarten van Meersbergen  Utrecht University  Institute for Marine and Atmospheric research  Henk Dijkstra  VU:  Ben van Werkhoven, Henri Bal COMMIT/

Performance TeraSort (I/O intensive application)

Performance k-Means on CPU CONVERT –RESIZE OR –DENSITY 300 OR INKSCAPE

Example

Example [OLD SLIDE]

Scale up? Amazon EC2

Performance evaluation

GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/

Similar presentations

Presentation on theme: "GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/

Similar presentations

Presentation on theme: "GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/"— Presentation transcript:

Similar presentations

About project

Feedback