Download presentation
Presentation is loading. Please wait.
Published byEsther Iris Terry Modified over 9 years ago
1
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh, Amit Sabne, Rudolf Eigenmann (Purdue) Presentation available at: engineering.purdue.edu/dcsl
2
Slide 2/8 Emerging Trend Heterogeneous computing is gaining ground as a way to accelerate performance of parallel applications Buzzword is “accelerators” –Graphics Processing Units (GPUs) –Field Programmable Gate Arrays (FPGAs) Attraction is high degree of parallelism close to the main processor –Example: Dell Poweredge servers have 2 Kepler GPUs with a total of 2 1536 CUDA cores
3
Slide 3/8 But … not so fast Programming models for these architectures hide lots of architecture details –As they should But, these architectures are ripe for committing horrendous performance errors –Even more so than in traditional CPU architectures Why? –FPGA: Constrained on chip memory; careless program can wipe out any performance improvement by going to main processor –GPU: Multiple levels of memory hierarchy with widely different access latencies; Identical control flow mandated for all threads within a block
4
Slide 4/8 GPU Schematic CUDA hierarchy of threads, thread blocks, and grids with per thread private, per block shared, and per application global memory spaces Memory hierarchy
5
Slide 5/8 Specs leading to Performance Problem Shared memory and L1 cache are limited –16 KB-48 KB or 48 KB-16 KB –Very fast access: 1+ TB/s Global memory is accessible by all threads on the GPU –Larger amount of memory: 8 GB –Slower access: 320 GB/s If communication with the host memory is required (over PCI Express bus), then much slower –PLDI 12 paper shows a 5X speedup if avoiding cyclic communication
6
Slide 6/8 Common Patterns of Performance Bugs Memory bugs –Un-coalescing memory access –Bank conflict of shared memory –Channel skew in global memory –The schedule of transmission of host to device memory Multi-thread bugs –Block/Thread configuration –Branch divergence Synchronization bugs
7
Slide 7/8 Performance debugger work flow Benchmarking (small scales or small data) Profiling Detect performance anomaly Localize the problem Automatic program transformation Re-benchmarking Acceptable? NO Break Yes Program Static Analysis
8
Slide 8/8 Example of a Performance Bug Matrix transpose on GPU –The memory bandwidth of GTX 280 is 140 GB/sec For 2048 2048 matrix –Naïve transpose: 2.2 GB/s –Coalesced transpose: 17.1 GB/s
9
Slide 9/8 Can We Do This Automatically? Training Phase (A Series of Small-scale Testing Runs) –Instrumentation to record observational features –Modeling to train a model that can predict observational features from control features Deployment Phase (Large-scale Production Runs) –Instrumentation to record the same features –Detection to flag production runs with negative correlation –Localization Use the trained model to reconstruct observational feature Rank features by reconstruction error Some lessons from our prior work [HPDC `11] [HotDep `12]
10
Slide 10/8 Can We Do This Automatically? Maybe Some lessons from our prior work [HPDC `11] [HotDep `12] corr(f( ), g( )) < 0 y y x x BUG! Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated Behavioral Feature Scale of Execution
11
Slide 11/8 g’ -1 (f (x)) ABHRANTA: a Predictive Model for Program Behavior at Large Scale ABHRANTA replaced non-invertible transform g used by Vrisha with a linear transform g’ The new model provides an automatic way to reconstruct “bug-free” behavior at large scale, lifting the burden of manual analysis of program scaling behavior g’(*) x x f(x)
12
Slide 12/8 Results from HPC Benchmark AMG2006 is a parallel algebraic multigrid solver for linear systems, written in 104K lines of C code. –The application is configured to solve the default 3D Laplace type problem Train on 8-128 node runs, test at larger scales (up to 4096 nodes) Fault injection study – Integer overflows, buffer overflows Control features: X, Y, Z dimensions of 3D grid Observational features: All conditionals indexed by calling context
13
Slide 13/8 Can We Do This for GPU Programs? We think we can (Wild ?) Speculation! Features that make this approach more feasible 1.More regular kernels than general purpose programs 2.Good places to insert monitors to observe behavioral features 3.Often spare computational capacity close by 4.Types of performance bugs are limited 5.Types of program transformations limited
14
Slide 14/8 Presentation available at: Dependable Computing Systems Lab (DCSL) web site engineering.purdue.edu/dcsl
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.