Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/

Similar presentations


Presentation on theme: "GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/"— Presentation transcript:

1 GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/

2 Graphics Processing Units ● GPUs and other accelerators take top-500 by storm ● Many application success stories ● But GPUs are notoriously difficult to program and optimize http://www.nvidia.com/object/ tesla-case-studies.html

3 Example 1: convolution ● About half a Ph.D. thesis Naive Fully optimized

4 Example 2: Auto-Tuning Dedispersion ● >100 configurations, the winner is an outlier ● (Number of threads, amount of work per thread, etc.)

5 Example 3: Parallel Programming Lab course ● Master course practical (next to lectures) ● CUDA: ● Simple image processing application on 1 node ● MPI: ● Parallel all pairs shortest path algorithms ● CUDA: 11 out of 21 passed (52 %) ● MPI: 17 out of 21 passed (80 %)

6 Questions ● Why are accelerators so difficult to program? ● What are the challenges for Computer Science? ● What role do applications play?

7 Background ● Netherlands eScience Center ● Bridge between ICT and applications (climate modeling, astronomy, water management, digital forensics, …..) ● COMMIT/ ● COMMIT (100 M€): public-private Dutch ICT program ● Distributed ASCI Supercomputer ● Testbed for Dutch Computer Science COMMIT/

8 Background (team) Ph.D. students ● Ben van Werkhoven ● Alessio Sclocco ● Ismail El Hewl ● Pieter Hijma Staff ● Rob van Nieuwpoort (NLeSC) ● Ana Varbanescu (UvA) Scientific programmers ● Rutger Hofman ● Ceriel Jacobs

9 Differences CPUs and CPUs ● Different goals produce different designs ● CPU must be good at everything, parallel or not ● GPU assumes work load is highly parallel ● CPU: minimize latency of 1 thread ● Big on-chip caches ● Sophisticated control logic ● GPU: maximize throughput of all threads ● Multithreading can hide latency → no big caches ● Share control logic across many threads Control ALU Cache

10 Example: NVIDIA Maxwell ● 16 independent streaming multiprocessors (SM) ● 128 cores per SM (2048 total) ● 96KB shared memory

11 Thread hierarchy ● All threads execute the same sequential kernel ● Threads are grouped into thread blocks ● Threads in same block execute on same SM and can work together (synchronize) ● 32 contiguous threads execute as a warp and execute the same instruction in parallel ● Thread blocks are grouped into grid ● Many thousands of threads in total, scheduled by hardware, without preemption

12 Hierarchy of concurrent threads

13 Memory hierarchy (NVIDIA example) ● Shared memory ● Small, fast, on SM, allocated to thread blocks ● Register file ● On SM, private per thread ● Global memory ● Large, off-chip, slow, accessible by host ● Constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Constant Memory

14 Agenda Application case studies Multimedia kernel (convolution) Astronomy kernel (dedispersion) Climate modelling: optimizing multiple kernels Programming methodologies Stepwise refinement: new methodology & model Glasswing: MapReduce on accelerators

15 Application case study 1: Convolution operations Image I of size I w by I h Filter F of size F w by F h Thread block of size B w by B h CUDA kernel Does 2 arithmetic operations and 2 loads (8 bytes) Arithmetic Intensity (AI) = 0.25

16 Tiled convolution 16x16 thread block processing an 11x 7 filter Filter goes into constant memory (small) Threads within a block cooperatively load entire area needed by all threads in the block into shared memory

17 Analysis ● If filter size increases: ● Arithmetic Intensive increases: ● Kernel shifts from memory-bandwidth bound to compute-bound ● Amount of shared memory needed increases → fewer thread blocks can run concurrently on each SM

18 Tiling ● Each thread block computes 1xN tiles in horizontal direction +Increases amount of work per thread +Saves loading overlapping borders +Saves redundant instructions -More shared memory, fewer concurrent thread blocks No shared memory bank conflicts

19 Adaptive tiling ● Tiling factor is selected at runtime depending on the input data and the resource limitations of the device ● Highest possible tiling factor that fits within the shared memory available (depending on filter size) ● Plus loop unrolling, memory banks, search optimal configuration

20 Lessons learned ● Everything must be in balance to obtain high performance ● Subtle interactions between resource limits ● Runtime Ddecision system (adaptive tiling), in combination with standard optimizations ● Loop unrolling, memory bank conflics Ph.D. thesis Ben van Werkhoven, 27 Oct. 2014 FGCS journal, 2014

21 Application case study 2: Dedispersion ● Auto-Tuning dedispersion for many-core Accelerators ● Used for searching pulsars in radio astronomy data Alessio Sclocco et al.: Auto-Tuning Dedispersion for Many-Core Accelerators, IPDPS 2014

22 Dedispersion ● Pulsar signals get dispersed: lower radio frequencies arrive progressively later ● Non-linear function of distance between source & receiver ● Can be reversed by shifting in time the signal’s lower frequencies (dedispersion)

23 Auto-tuning ● Using auto-tuning to adapt the algorithm for: ● Different many-core platforms ● NVIDIA&AMD GPUs, Intel Phi, Xeon, … ● Different observational scenarios ● LOFAR, Apertif ● Different number of Dispersion Measures (DMs) ● Represents number of free electrons between source & receiver ● Measure of distance between emitting object & receiver ● Parameters: ● Number of threads per sample or DM, thread block size, number of registers per thread, ….

24 Auto-Tuning: Registers LOFAR Apertif

25 Example histogram

26 Speedup over Best Fixed Configuration

27 Lessons learned ● Autotuning allows algorithms to adapt to different platforms and scenarios ● The impact that auto-tuning has on dedispersion is significant ● Guessing a good configuration without auto-tuning is difficult

28 Application case study 3: High-Resolution Global Climate Modeling ● Understand future local sea level changes ● Quantify the effect of changes in freshwater input & ocean circulation on regional sea level height in the Atlantic ● To obtain high resolution, use: ● Distributed computing (multiple resources) ● Based on research on wide-area optimizations done 16 years ago on DAS-1 (Albatross project) ● GPU Computing ● Good example of application-inspired Computer Science research COMMIT/

29 Distributed Computing ● Use Ibis to couple different simulation models ● Land, ice, ocean, atmosphere ● Wide-area optimizations like hierarchical load balancing

30 Enlighten Your Research Global award EMERALD (UK) KRAKEN (USA) STAMPEDE (USA) SUPERMUC (GER) #7 #10 10G CARTESIUS (NLD) 10G

31 GPU Computing ● Offload expensive kernels for Parallel Ocean Program (POP) from CPU to GPU ● Many different kernels, fairly easy to port to GPUs ● Vertical mixing / Barotropic solvers / ‘State’ calculation …. ● Execution time becomes virtually 0 ● New bottleneck: moving data between CPU & GPU CPU host memory GPU device memory Host Device PCI Express link

32 Different methods for CPU-GPU communication ● Memory copies (explicit) ● No overlap with GPU computation ● Device-mapped host memory (implicit) ● Allows fine-grained overlap between computation and communication in either direction ● CUDA Streams or OpenCL command-queues ● Allows overlap between computation and communication in different streams ● Any combination of the above

33 Problem ● Problem: ● Which method will be most efficient for a given GPU kernel? Implementing all can be a large effort ● Solution: ● Create a performance model that identifies the best implementation: ● What implementation strategy for overlapping computation and communication is best for my program? Ben van Werkhoven, Jason Maassen, Frank Seinstra & Henri Bal: Performance models for CPU-GPU data transfers, CCGrid2014 (nominated for best-paper-award)

34 Analytical performance model ● What implementation strategy for overlapping computation and communication is best for my program? ● Other questions: ● How much performance is gained from overlapping? ● Is the PCIe link actually a bottleneck in my program? ● What number of streams is likely give the best performance? ● How will future architectures impact my program’s performance?

35 Example result ● Implicit Synchronization and 1 copy engine ● 2 POP kernels (state and buoydiff) ● GTX 680 connected over PCIe 2.0 MeasuredModel

36 Different GPUs buoydiff state

37 MOVIE

38 Comes with spreadsheet

39 Lessons learned ● PCIe transfers can have a large performance impact ● Several methods for transferring data and overlapping computation & communication exist ● Performance modelling helps to select the best mechanism

40 Why is GPU programming hard? ● Mapping algorithm to architecture is difficult, especially as the architecture is difficult: ● Many levels of parallelism ● Limited resources (registers, shared memory) ● Less of everything (except parallelism) than CPU (especially per thread), makes problem-partitioning difficult ● Everything must be in balance to obtain performance ● Subtle interactions between resource limits

41 Why is GPU programming hard? ● Many crucial high-impact optimizations needed: ● Data reuse ● Use shared memory efficiently ● Limited by #registers per thread, shared memory per thread block ● Memory access patterns ● Avoid shared memory bank conflicts ● Global memory coalescing ● Instruction stream optimization ● Control flow divergence ● Thread level parallelism ● Maximize occupancy: avoid that all warps get stalled ● Loop unrolling

42 Why is GPU programming hard? ● Portability ● Optimizations are architecture-dependent, and the architectures change frequently ● Optimizations are often input dependent ● Finding the right parameters settings is difficult ● Need better performance models ● Like Roofline and our I/O model

43 Agenda Application case studies Multimedia kernel (convolution) Astronomy kernel (dedispersion) Optimizing the I/O of multiple kernels Climate modelling Programming methodologies MapReduce: using existing programming model Stepwise refinement: new methodology & model

44 Programming methodology: stepwise refinement for performance ● Programming accelerators: ● Tension: control over hardware vs. abstraction-level ● Methodology: ● Integrate hardware descriptions into programming model ● Programmers can work on multiple levels of abstraction ● Performance feedback from compiler, based on hardware description and kernel ● Cooperation between compiler and programmer P. Hijma et al., Stepwise-refinement for Performance: a methodology for many-core programming,” Concurrency and Computation: Practice and Experience (accepted)

45 MCL: Many-Core Levels ● MCL program is an algorithm mapped to hardware ● Start at a suitable abstraction level ● E.g. idealized accelerator, NVIDIA Kepler GPU, Xeon Phi ● MCL compiler guides programmer which optimizations to apply on given abstraction level or to move to deeper levels

46 Comparison ● High-level programming ● BSP, D&C, Nested Data-parallelism, array-extensions, …. ● Automatic optimizers ● Skeletons ● Separation of concerns ● Domain-specific languages ● Tuning cycle aproach ● CUDA/OpenCL + performance modeling/profiling

47 MCL ecosystem

48 Example

49 MCL example (Pieter)

50 Convolution in MCL

51 Examples compiler feedback ● Use local [= CONSTANT?] memory for Filter ● Use shared memory for Input (data reuse) ● Compute multiple elements per thread (2×2 tiles) ● Try to maximize the number of blocks per SMP. This depends on the number of threads, amount of shared memory and the number of registers ● Trade-off: more data-reuse, fewer blocks per SM ● Change tiling to 2 blocks per SM ● No loop-unrolling in compiler yet

52 Performance (GTX480, 9×9 filters) 380 GFLOPS MCL: 302 GFLOPS

53 Performance evaluation

54 Status ● Prototype implementation for various accelerators ● GTX480, Xeon Phi ● Various algorithms and small applications ● Source code available at: ?????? ● Current work: integration with Satin divide-and- conquer system: ● Almost all possible levels of parallelism, from GPUs to (ultimately) wide-area systems

55 Glasswing: MapReduce on Accelerators ● Big Data revolution ● Designed for cheap commodity hardware ● Scales horizontally ● Coarse-grained parallelism ● MapReduce on modern hardware? Ismail El Helw, Rutger Hofman, Henri Bal [HPDC’2014, SC’2014]

56 MapReduce

57 MapReduce model

58 Rethinking MapReduce ● Use accelerators (OpenCL) as mainstream feature ● Massive out-of-core data sets ● Scale vertically & horizontally ● Maintain MapReduce abstraction

59 Glasswing Pipeline ● Overlaps computation, communication & disk access ● Supports multiple buffering levels

60 GPU optimizations ● Glasswing framework does: ● Custom memory allocators ● Shared memory optimizations (partially in framework, partially in kernels) ● Atomic operations on shared memory, aggregate, then global memory ● Data movements (cf. CUDA streams), data staging ● Programmer may do: ● Kernel optimizations (coalescing, memory banks, etc.)

61 Evaluation on DAS-4 ● 64-node cluster ● Dual quad-core Intel Xeon 2.4GHz CPU ● 24GB of memory ● 2x1TB disks (RAID0) ● 16 nodes equipped with Nvidia GTX480 GPUs ● QDR Inniband

62 Glasswing vs. Hadoop 64-node CPU cluster

63 Glasswing vs. Hadoop 16-Node GPU Cluster

64 Performance k-Means on CPU

65 Performance k-Means on GPU

66 Compute Device Comparison

67 Glasswing conclusions ● Scalable MapReduce framework combining coarse- grained and fine-grained parallelism ● Handles out-of-core data, sticks with MapReduce model ● Overlaps kernel executions with memory transfers, network communication and disk access ● Current work: ● Machine learning applications ● Energy efficiency

68 Wrap-up: no single solution ● eScience applications need performance of GPUs ● GPU programming is very time-consuming ● Need new methodologies: ● Autotuning ● Determine optimal configuration in Dedispersion or Convolution ● Compile-time, runtime, or adaptive ● Performance modelling ● Compiler-based reasoning about performance ● Frameworks, templates, patterns (like MapReduce)

69 Future ● Challenges for CS ● Applications

70 EXTRA/NOT USED

71 A Common Programming Strategy  Partition data into subsets that fit into shared memory 71

72 A Common Programming Strategy  Handle each data subset with one thread block 72

73 A Common Programming Strategy  Load the subset from device memory to shared memory, using multiple threads to exploit memory- level parallelism 73

74 A Common Programming Strategy  Perform the computation on the subset from shared memory 74

75 A Common Programming Strategy  Copy the result from shared memory back to device memory 75

76 Distributed ASCI Supercomputer  Distributed common infrastructure for Dutch Computer Science  Distributed: multiple (4-6) clusters at different locations  Common: single formal owner (ASCI), single design team Users have access to entire system  Dedicated to CS experiments (like Grid’5000) Interactive (distributed) experiments, low resource utilization Able to modify/break the hardware and systems software Going Dutch: How to Share a Dedicated Distributed Infrastructure for Computer Science Research, keynote lecture at Euro-Par 2014 (Porto, 28 August 2014), http://www.cs.vu.nl/~bal/Talks/Europar2014.pptx

77 DAS generations: visions  DAS-1: Wide-area computing (1997)  Homogeneous hardware and software  DAS-2: Grid computing (2002)  Globus middleware  DAS-3: Optical Grids (2006)  Dedicated 10 Gb/s optical links between all sites  DAS-4: Clouds, diversity, green IT (2010)  Hardware virtualization, accelerators, energy measurements  DAS-5: Harnessing diversity, data-explosion (2015)  Wide variety of accelerators, larger memories and disks

78 ASCI (1995)  Research schools (Dutch product from 1990s), aims:  Stimulate top research & collaboration  Provide Ph.D. education (courses)  ASCI: Advanced School for Computing and Imaging  About 100 staff & 100 Ph.D. Students  16 PhD level courses  Annual conference

79 DAS-4 (2011) Testbed for Clouds, diversity, green IT Dual quad-core Xeon E5620 Infiniband Various accelerators Scientific Linux Bright Cluster Manager Built by ClusterVision VU (74) TU Delft (32)Leiden (16) UvA/MultimediaN (16/36) SURFnet6 ASTRON (23) 10 Gb/s

80 Accelerators VU cluster  23 NVidia GTX480  2 Nvidia C2050 Tesla GPU;  NVidia GTX680 GPU  1 node with X5650 CPU (dual 6-cores, 2.67 GHz)  1 48-core (quad socket "Magny Cours") AMD system  1 AMD Radeon HD7970  1 Intel "Sandy Bridge" E5-2630 node (2.3 GHz), NVidia GTX-Titan.  ? Intel "Sandy Bridge" E5-2620 (2.0 GHz) nodes, K20m "Kepler" GPU.  2 Intel Xeon Phi accelerator.

81 Bird-eye view 81 core Memory channel core Memory channel CPUmany-core

82 [[[Memory Spaces in CUDA]]] Grid Device Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

83 Key architectural ideas  Data parallel, like a vector machine  There, 1 thread issues parallel vector instructions  SIMT (Single Instruction Multiple Thread) execution  Many threads work on a vector, each on a different element  They all execute the same instruction  Hardware automatically handles divergence  Hardware multithreading  HW resource allocation & thread scheduling  HW relies on threads to hide latency  Context switching is (basically) free

84 http://www.nvidia.com/object/tesla- case-studies.html

85 CUDA Model of Parallelism  CUDA virtualizes the physical hardware  Block is a virtualized streaming multiprocessor  Thread is a virtualized scalar processor  Scheduled onto physical hardware without pre-emption  Threads/blocks launch & run to completion  Blocks should be independent Block Shared Memory Shared Memory Block Shared Memory Shared Memory Device Memory

86 ALTERNATIVES registers Device memory L1 cache / shared mem L2 cache Host memory PCI-e bus

87 Thread Per-thread Local Memory SM Per-SM Shared Memory Kernel 0 Per-device Global Memory … Kernel 1 …

88 Host Grid Device Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Constant Memory

89 Performance k-Means on CPU

90 Performance k-Means on GPU

91 Rough plan: Explain why GPUs are hard to program Applications Multimedia Astronomy [illustrate why GPU programming is hard] Climate modelling [Multiple kernel, I/O: climate problem more or less solved] Programming methodologies MapReduce [existing programming model] Stepwise refinement [a step towards a solution]

92 Application case study 3: Global Climate Modeling  Netherlands eScience Center:  Builds bridges between applications & ICT (Ibis, JavaGAT)  Frank Seinstra, Jason Maassen, Maarten van Meersbergen  Utrecht University  Institute for Marine and Atmospheric research  Henk Dijkstra  VU:  Ben van Werkhoven, Henri Bal COMMIT/

93 Performance TeraSort (I/O intensive application)

94 Performance k-Means on CPU CONVERT –RESIZE OR –DENSITY 300 OR INKSCAPE

95 Example

96 Example [OLD SLIDE]

97 Scale up? Amazon EC2

98 Performance evaluation


Download ppt "GPU programming: eScience or engineering? Henri Bal Vrije Universiteit Amsterdam COMMIT/"

Similar presentations


Ads by Google