Download presentation
Presentation is loading. Please wait.
Published byHester Sullivan Modified over 8 years ago
1
GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/
2
Graphics Processing Units ● GPUs and other accelerators take top-500 by storm ● Many application success stories ● But GPUs are notoriously difficult to program and optimize http://www.nvidia.com/object/ tesla-case-studies.html
3
Example 1: convolution ● About half a Ph.D. thesis Naive Fully optimized
4
Example 2: Auto-Tuning Dedispersion ● >100 configurations, the winner is an outlier ● (Number of threads, amount of work per thread, etc.)
5
Example 3: Parallel Programming Lab course ● Master course practical (next to lectures) ● CUDA: ● Simple image processing application on 1 node ● MPI: ● Parallel all pairs shortest path algorithms ● CUDA: 11 out of 21 passed (52 %) ● MPI: 17 out of 21 passed (80 %)
6
Questions ● Why are accelerators so difficult to program? ● What are the challenges for Computer Science? ● What role do applications play?
7
Background ● Netherlands eScience Center ● Bridge between ICT and applications (climate modeling, astronomy, water management, digital forensics, …..) ● COMMIT/ ● COMMIT (100 M€): public-private Dutch ICT program ● Distributed ASCI Supercomputer ● Testbed for Dutch Computer Science COMMIT/
8
Background (team) Ph.D. students ● Ben van Werkhoven ● Alessio Sclocco ● Ismail El Hewl ● Pieter Hijma Staff ● Rob van Nieuwpoort (NLeSC) ● Ana Varbanescu (UvA) Scientific programmers ● Rutger Hofman ● Ceriel Jacobs
9
Differences CPUs and CPUs ● Different goals produce different designs ● CPU must be good at everything, parallel or not ● GPU assumes work load is highly parallel ● CPU: minimize latency of 1 thread ● Big on-chip caches ● Sophisticated control logic ● GPU: maximize throughput of all threads ● Multithreading can hide latency → no big caches ● Share control logic across many threads Control ALU Cache
10
Example: NVIDIA Maxwell ● 16 independent streaming multiprocessors (SM) ● 128 cores per SM (2048 total) ● 96KB shared memory
11
Thread hierarchy ● All threads execute the same sequential kernel ● Threads are grouped into thread blocks ● Threads in same block execute on same SM and can work together (synchronize) ● 32 contiguous threads execute as a warp and execute the same instruction in parallel ● Thread blocks are grouped into grid ● Many thousands of threads in total, scheduled by hardware, without preemption
12
Hierarchy of concurrent threads
13
Memory hierarchy (NVIDIA example) ● Shared memory ● Small, fast, on SM, allocated to thread blocks ● Register file ● On SM, private per thread ● Global memory ● Large, off-chip, slow, accessible by host ● Constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Constant Memory
14
Agenda Application case studies Multimedia kernel (convolution) Astronomy kernel (dedispersion) Climate modelling: optimizing multiple kernels Programming methodologies Stepwise refinement: new methodology & model Glasswing: MapReduce on accelerators
15
Application case study 1: Convolution operations Image I of size I w by I h Filter F of size F w by F h Thread block of size B w by B h CUDA kernel Does 2 arithmetic operations and 2 loads (8 bytes) Arithmetic Intensity (AI) = 0.25
16
Tiled convolution 16x16 thread block processing an 11x 7 filter Filter goes into constant memory (small) Threads within a block cooperatively load entire area needed by all threads in the block into shared memory
17
Analysis ● If filter size increases: ● Arithmetic Intensive increases: ● Kernel shifts from memory-bandwidth bound to compute-bound ● Amount of shared memory needed increases → fewer thread blocks can run concurrently on each SM
18
Tiling ● Each thread block computes 1xN tiles in horizontal direction +Increases amount of work per thread +Saves loading overlapping borders +Saves redundant instructions -More shared memory, fewer concurrent thread blocks No shared memory bank conflicts
19
Adaptive tiling ● Tiling factor is selected at runtime depending on the input data and the resource limitations of the device ● Highest possible tiling factor that fits within the shared memory available (depending on filter size) ● Plus loop unrolling, memory banks, search optimal configuration
20
Lessons learned ● Everything must be in balance to obtain high performance ● Subtle interactions between resource limits ● Runtime Ddecision system (adaptive tiling), in combination with standard optimizations ● Loop unrolling, memory bank conflics Ph.D. thesis Ben van Werkhoven, 27 Oct. 2014 FGCS journal, 2014
21
Application case study 2: Dedispersion ● Auto-Tuning dedispersion for many-core Accelerators ● Used for searching pulsars in radio astronomy data Alessio Sclocco et al.: Auto-Tuning Dedispersion for Many-Core Accelerators, IPDPS 2014
22
Dedispersion ● Pulsar signals get dispersed: lower radio frequencies arrive progressively later ● Non-linear function of distance between source & receiver ● Can be reversed by shifting in time the signal’s lower frequencies (dedispersion)
23
Auto-tuning ● Using auto-tuning to adapt the algorithm for: ● Different many-core platforms ● NVIDIA&AMD GPUs, Intel Phi, Xeon, … ● Different observational scenarios ● LOFAR, Apertif ● Different number of Dispersion Measures (DMs) ● Represents number of free electrons between source & receiver ● Measure of distance between emitting object & receiver ● Parameters: ● Number of threads per sample or DM, thread block size, number of registers per thread, ….
24
Auto-Tuning: Registers LOFAR Apertif
25
Example histogram
26
Speedup over Best Fixed Configuration
27
Lessons learned ● Autotuning allows algorithms to adapt to different platforms and scenarios ● The impact that auto-tuning has on dedispersion is significant ● Guessing a good configuration without auto-tuning is difficult
28
Application case study 3: High-Resolution Global Climate Modeling ● Understand future local sea level changes ● Quantify the effect of changes in freshwater input & ocean circulation on regional sea level height in the Atlantic ● To obtain high resolution, use: ● Distributed computing (multiple resources) ● Based on research on wide-area optimizations done 16 years ago on DAS-1 (Albatross project) ● GPU Computing ● Good example of application-inspired Computer Science research COMMIT/
29
Distributed Computing ● Use Ibis to couple different simulation models ● Land, ice, ocean, atmosphere ● Wide-area optimizations like hierarchical load balancing
30
Enlighten Your Research Global award EMERALD (UK) KRAKEN (USA) STAMPEDE (USA) SUPERMUC (GER) #7 #10 10G CARTESIUS (NLD) 10G
31
GPU Computing ● Offload expensive kernels for Parallel Ocean Program (POP) from CPU to GPU ● Many different kernels, fairly easy to port to GPUs ● Vertical mixing / Barotropic solvers / ‘State’ calculation …. ● Execution time becomes virtually 0 ● New bottleneck: moving data between CPU & GPU CPU host memory GPU device memory Host Device PCI Express link
32
Different methods for CPU-GPU communication ● Memory copies (explicit) ● No overlap with GPU computation ● Device-mapped host memory (implicit) ● Allows fine-grained overlap between computation and communication in either direction ● CUDA Streams or OpenCL command-queues ● Allows overlap between computation and communication in different streams ● Any combination of the above
33
Problem ● Problem: ● Which method will be most efficient for a given GPU kernel? Implementing all can be a large effort ● Solution: ● Create a performance model that identifies the best implementation: ● What implementation strategy for overlapping computation and communication is best for my program? Ben van Werkhoven, Jason Maassen, Frank Seinstra & Henri Bal: Performance models for CPU-GPU data transfers, CCGrid2014 (nominated for best-paper-award)
34
Analytical performance model ● What implementation strategy for overlapping computation and communication is best for my program? ● Other questions: ● How much performance is gained from overlapping? ● Is the PCIe link actually a bottleneck in my program? ● What number of streams is likely give the best performance? ● How will future architectures impact my program’s performance?
35
Example result ● Implicit Synchronization and 1 copy engine ● 2 POP kernels (state and buoydiff) ● GTX 680 connected over PCIe 2.0 MeasuredModel
36
Different GPUs buoydiff state
37
MOVIE
38
Comes with spreadsheet
39
Lessons learned ● PCIe transfers can have a large performance impact ● Several methods for transferring data and overlapping computation & communication exist ● Performance modelling helps to select the best mechanism
40
Why is GPU programming hard? ● Mapping algorithm to architecture is difficult, especially as the architecture is difficult: ● Many levels of parallelism ● Limited resources (registers, shared memory) ● Less of everything (except parallelism) than CPU (especially per thread), makes problem-partitioning difficult ● Everything must be in balance to obtain performance ● Subtle interactions between resource limits
41
Why is GPU programming hard? ● Many crucial high-impact optimizations needed: ● Data reuse ● Use shared memory efficiently ● Limited by #registers per thread, shared memory per thread block ● Memory access patterns ● Avoid shared memory bank conflicts ● Global memory coalescing ● Instruction stream optimization ● Control flow divergence ● Thread level parallelism ● Maximize occupancy: avoid that all warps get stalled ● Loop unrolling
42
Why is GPU programming hard? ● Portability ● Optimizations are architecture-dependent, and the architectures change frequently ● Optimizations are often input dependent ● Finding the right parameters settings is difficult ● Need better performance models ● Like Roofline and our I/O model
43
Agenda Application case studies Multimedia kernel (convolution) Astronomy kernel (dedispersion) Optimizing the I/O of multiple kernels Climate modelling Programming methodologies MapReduce: using existing programming model Stepwise refinement: new methodology & model
44
Programming methodology: stepwise refinement for performance ● Programming accelerators: ● Tension: control over hardware vs. abstraction-level ● Methodology: ● Integrate hardware descriptions into programming model ● Programmers can work on multiple levels of abstraction ● Performance feedback from compiler, based on hardware description and kernel ● Cooperation between compiler and programmer P. Hijma et al., Stepwise-refinement for Performance: a methodology for many-core programming,” Concurrency and Computation: Practice and Experience (accepted)
45
MCL: Many-Core Levels ● MCL program is an algorithm mapped to hardware ● Start at a suitable abstraction level ● E.g. idealized accelerator, NVIDIA Kepler GPU, Xeon Phi ● MCL compiler guides programmer which optimizations to apply on given abstraction level or to move to deeper levels
46
Comparison ● High-level programming ● BSP, D&C, Nested Data-parallelism, array-extensions, …. ● Automatic optimizers ● Skeletons ● Separation of concerns ● Domain-specific languages ● Tuning cycle aproach ● CUDA/OpenCL + performance modeling/profiling
47
MCL ecosystem
48
Example
49
MCL example (Pieter)
50
Convolution in MCL
51
Examples compiler feedback ● Use local [= CONSTANT?] memory for Filter ● Use shared memory for Input (data reuse) ● Compute multiple elements per thread (2×2 tiles) ● Try to maximize the number of blocks per SMP. This depends on the number of threads, amount of shared memory and the number of registers ● Trade-off: more data-reuse, fewer blocks per SM ● Change tiling to 2 blocks per SM ● No loop-unrolling in compiler yet
52
Performance (GTX480, 9×9 filters) 380 GFLOPS MCL: 302 GFLOPS
53
Performance evaluation
54
Status ● Prototype implementation for various accelerators ● GTX480, Xeon Phi ● Various algorithms and small applications ● Source code available at: ?????? ● Current work: integration with Satin divide-and- conquer system: ● Almost all possible levels of parallelism, from GPUs to (ultimately) wide-area systems
55
Glasswing: MapReduce on Accelerators ● Big Data revolution ● Designed for cheap commodity hardware ● Scales horizontally ● Coarse-grained parallelism ● MapReduce on modern hardware? Ismail El Helw, Rutger Hofman, Henri Bal [HPDC’2014, SC’2014]
56
MapReduce
57
MapReduce model
58
Rethinking MapReduce ● Use accelerators (OpenCL) as mainstream feature ● Massive out-of-core data sets ● Scale vertically & horizontally ● Maintain MapReduce abstraction
59
Glasswing Pipeline ● Overlaps computation, communication & disk access ● Supports multiple buffering levels
60
GPU optimizations ● Glasswing framework does: ● Custom memory allocators ● Shared memory optimizations (partially in framework, partially in kernels) ● Atomic operations on shared memory, aggregate, then global memory ● Data movements (cf. CUDA streams), data staging ● Programmer may do: ● Kernel optimizations (coalescing, memory banks, etc.)
61
Evaluation on DAS-4 ● 64-node cluster ● Dual quad-core Intel Xeon 2.4GHz CPU ● 24GB of memory ● 2x1TB disks (RAID0) ● 16 nodes equipped with Nvidia GTX480 GPUs ● QDR Inniband
62
Glasswing vs. Hadoop 64-node CPU cluster
63
Glasswing vs. Hadoop 16-Node GPU Cluster
64
Performance k-Means on CPU
65
Performance k-Means on GPU
66
Compute Device Comparison
67
Glasswing conclusions ● Scalable MapReduce framework combining coarse- grained and fine-grained parallelism ● Handles out-of-core data, sticks with MapReduce model ● Overlaps kernel executions with memory transfers, network communication and disk access ● Current work: ● Machine learning applications ● Energy efficiency
68
Wrap-up: no single solution ● eScience applications need performance of GPUs ● GPU programming is very time-consuming ● Need new methodologies: ● Autotuning ● Determine optimal configuration in Dedispersion or Convolution ● Compile-time, runtime, or adaptive ● Performance modelling ● Compiler-based reasoning about performance ● Frameworks, templates, patterns (like MapReduce)
69
Future ● Challenges for CS ● Applications
70
EXTRA/NOT USED
71
A Common Programming Strategy Partition data into subsets that fit into shared memory 71
72
A Common Programming Strategy Handle each data subset with one thread block 72
73
A Common Programming Strategy Load the subset from device memory to shared memory, using multiple threads to exploit memory- level parallelism 73
74
A Common Programming Strategy Perform the computation on the subset from shared memory 74
75
A Common Programming Strategy Copy the result from shared memory back to device memory 75
76
Distributed ASCI Supercomputer Distributed common infrastructure for Dutch Computer Science Distributed: multiple (4-6) clusters at different locations Common: single formal owner (ASCI), single design team Users have access to entire system Dedicated to CS experiments (like Grid’5000) Interactive (distributed) experiments, low resource utilization Able to modify/break the hardware and systems software Going Dutch: How to Share a Dedicated Distributed Infrastructure for Computer Science Research, keynote lecture at Euro-Par 2014 (Porto, 28 August 2014), http://www.cs.vu.nl/~bal/Talks/Europar2014.pptx
77
DAS generations: visions DAS-1: Wide-area computing (1997) Homogeneous hardware and software DAS-2: Grid computing (2002) Globus middleware DAS-3: Optical Grids (2006) Dedicated 10 Gb/s optical links between all sites DAS-4: Clouds, diversity, green IT (2010) Hardware virtualization, accelerators, energy measurements DAS-5: Harnessing diversity, data-explosion (2015) Wide variety of accelerators, larger memories and disks
78
ASCI (1995) Research schools (Dutch product from 1990s), aims: Stimulate top research & collaboration Provide Ph.D. education (courses) ASCI: Advanced School for Computing and Imaging About 100 staff & 100 Ph.D. Students 16 PhD level courses Annual conference
79
DAS-4 (2011) Testbed for Clouds, diversity, green IT Dual quad-core Xeon E5620 Infiniband Various accelerators Scientific Linux Bright Cluster Manager Built by ClusterVision VU (74) TU Delft (32)Leiden (16) UvA/MultimediaN (16/36) SURFnet6 ASTRON (23) 10 Gb/s
80
Accelerators VU cluster 23 NVidia GTX480 2 Nvidia C2050 Tesla GPU; NVidia GTX680 GPU 1 node with X5650 CPU (dual 6-cores, 2.67 GHz) 1 48-core (quad socket "Magny Cours") AMD system 1 AMD Radeon HD7970 1 Intel "Sandy Bridge" E5-2630 node (2.3 GHz), NVidia GTX-Titan. ? Intel "Sandy Bridge" E5-2620 (2.0 GHz) nodes, K20m "Kepler" GPU. 2 Intel Xeon Phi accelerator.
81
Bird-eye view 81 core Memory channel core Memory channel CPUmany-core
82
[[[Memory Spaces in CUDA]]] Grid Device Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory
83
Key architectural ideas Data parallel, like a vector machine There, 1 thread issues parallel vector instructions SIMT (Single Instruction Multiple Thread) execution Many threads work on a vector, each on a different element They all execute the same instruction Hardware automatically handles divergence Hardware multithreading HW resource allocation & thread scheduling HW relies on threads to hide latency Context switching is (basically) free
84
http://www.nvidia.com/object/tesla- case-studies.html
85
CUDA Model of Parallelism CUDA virtualizes the physical hardware Block is a virtualized streaming multiprocessor Thread is a virtualized scalar processor Scheduled onto physical hardware without pre-emption Threads/blocks launch & run to completion Blocks should be independent Block Shared Memory Shared Memory Block Shared Memory Shared Memory Device Memory
86
ALTERNATIVES registers Device memory L1 cache / shared mem L2 cache Host memory PCI-e bus
87
Thread Per-thread Local Memory SM Per-SM Shared Memory Kernel 0 Per-device Global Memory … Kernel 1 …
88
Host Grid Device Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Constant Memory
89
Performance k-Means on CPU
90
Performance k-Means on GPU
91
Rough plan: Explain why GPUs are hard to program Applications Multimedia Astronomy [illustrate why GPU programming is hard] Climate modelling [Multiple kernel, I/O: climate problem more or less solved] Programming methodologies MapReduce [existing programming model] Stepwise refinement [a step towards a solution]
92
Application case study 3: Global Climate Modeling Netherlands eScience Center: Builds bridges between applications & ICT (Ibis, JavaGAT) Frank Seinstra, Jason Maassen, Maarten van Meersbergen Utrecht University Institute for Marine and Atmospheric research Henk Dijkstra VU: Ben van Werkhoven, Henri Bal COMMIT/
93
Performance TeraSort (I/O intensive application)
94
Performance k-Means on CPU CONVERT –RESIZE OR –DENSITY 300 OR INKSCAPE
95
Example
96
Example [OLD SLIDE]
97
Scale up? Amazon EC2
98
Performance evaluation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.