©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Slides:



Advertisements
Similar presentations
Lecture 12 Reduce Miss Penalty and Hit Time
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Performance Case Studies: Ion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
© David Kirk/NVIDIA and Wen-mei W
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Memory Management.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Chapter 2 Memory and process management
The University of Adelaide, School of Computer Science
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
ECE 445 – Computer Organization
VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
Parallel Computation Patterns (Scan)
Parallel Computation Patterns (Reduction)
Morgan Kaufmann Publishers
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
ECE 498AL Lecture 10: Control Flow
ECE 498AL Spring 2010 Lecture 10: Control Flow
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 4: Cut-off and Binning for Regular Data Sets 1

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Same scalability among all cutoff implementations Scalability and Performance of different algorithms for calculating electrostatic potential map. Direct Summation is accurate but has poor data scalability 2

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 DCS Algorithm for Electrostatic Potentials At each grid point, sum the electrostatic potential from all atoms Highly data-parallel But has quadratic complexity –Number of grid points  number of atoms –Both proportional to volume

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Algorithm for Electrostatic Potentials With a Cutoff Ignore atoms beyond a cutoff distance, r c –Typically 8Å–12Å –Long-range potential may be computed separately Number of atoms within cutoff distance is roughly constant (uniform atom density) –200 to 700 atoms within 8Å–12Å cutoff sphere for typical biomolecular structures

Cut-off Summation With fixed partial charge q i, electrostatic potential V at position r over all N atoms:, where ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Implementation Challenge For each tile of grid points, we need to identify the set of atoms that need to be examined –One could naively examine all atoms and only use the ones whose distance is within the given range (but this examination still takes time) –We need to avoid examining the atoms outside the range ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Binning A process that groups data to form a chunk called bin Each bin collectively represents a property for data in the bin Helps problem solving due to data coarsening Histogram, KD Tree, … ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Binning with uniform cube Divide the simulation volume with non- overlapping cubes Every atom in the simulation volume falls into a cube based on its spatial location After binning, each cube has unique index in the simulation space ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Spatial Sorting Using Binning Presort atoms into bins by location in space Each bin holds several atoms Cutoff potential only uses bins within r c –Yields a linear complexity cutoff potential algorithm

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Improving Work Efficiency Thread block examines atom bins up to the cutoff distance –Use a sphere of bins –All threads in a block scan the same atoms No hardware penalty for multiple simultaneous reads of the same address Simplifies fetching of data 10

The Neighborhood is a volume ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Neighborhood Offset List A list of offsets enumerating the bins that are located within the cutoff distance for a given location in the simulation volume Detection of surrounding atoms becomes realistic for output grid points –By visiting bins in the neighborhood offset list and iterating atoms they contain center (0, 0) (1, 2) not included cutoff distance (-1, -1) a bin in the neighborhood list

Bin Design Uniform sized bins allows array implementation Bin size (capacity) should be big enough to contain all the atoms that fall into a bin –Cut-off will screen away atoms that weren’t processed –Performance penalty if too many are screened away ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Pseudo Code // 1. binning for each atom in the simulation volume, index_of_bin := atom.addr / BIN_SIZE bin[index_of_bin] += atom // 2. generate the neighborhood offset list for each c from -cutoff to cutoff, if distance(0, c) < cutoff, nlist += c // 3. do the computation for each point in the output grid, index_of_bin := point.addr / BIN_SIZE for each offset in nlist, for each atom in bin[index_of_bin + offset], point.potential += atom.charge / (distance from point to atom) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 CPU GPU 14

Performance O(MN’) where M and N’ are the number of output grid points and atoms in the neighborhood offset list, respectively –In general, N’ is small compared to the number of all atoms Works well if the distribution of atoms is uniform ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Bin Size Considerations Capacity of atom bins needs to be balanced –Too large – many dummy atoms in bins –Too small – some atoms will not fit into bins –Target bin capacity to cover more than 95% or atoms CPU places all atoms that do not fit into bins into an overflow bin –Use a CPU sequential algorithm to calculate their contributions to the energy grid lattice points. –CPU and GPU can do potential calculations in parallel ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

Going from DCS Kernel to Large Bin Cut-off Kernel Adaptation of techniques from the direct Coulomb summation kernel for a cutoff kernel Atoms are stored in constant memory as with DCS kernel CPU loops over potential map regions that are (24Å) 3 in volume (cube containing cutoff sphere) Large bins of atoms are appended to the constant memory atom buffer until full, then GPU kernel is launched Host loops over map regions reloading constant memory and launching GPU kernels until complete ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Large Bin Design Concept Map regions are (24Å) 3 in volume Regions are sized large enough to provide the GPU enough work in a single kernel launch –(48 lattice points) 3 for lattice with 0.5Å spacing –Small bins don’t provide the GPU enough work to utilize all SMs, to amortize constant memory update time, or kernel launch overhead ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Large Bin Cut-off Kernel Code static __constant__ float4 atominfo[MAXATOMS]; __global__ static void mgpot_shortrng_energy(…) { […] for (n = 0; n < natoms; n++) { float dx = coorx - atominfo[n].x; float dy = coory - atominfo[n].y; float dz = coorz - atominfo[n].z; float q = atominfo[n].w; float dxdy2 = dx*dx + dy*dy; float r2 = dxdy2 + dz*dz; if (r2 < CUTOFF2) { float gr2 = GC0 + r2*(GC1 + r2*GC2); float r_1 = 1.f/sqrtf(r2); accum_energy_z0 += q * (r_1 - gr2); } … ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Large-bin Cutoff Kernel Evaluation 6  speedup relative to fast CPU version Work-inefficient –Coarse spatial hashing into (24Å) 3 bins –Only 6.5% of the atoms a thread tests are within the cutoff distance Better adaptation of the algorithm to the GPU will gain another 2.5  20

Small Bin Design For 0.5 Å lattice spacing, a (4 Å) 3 cube of the potential map is computed by each thread block –8  8  8 potential map points –128 threads per block (4 points/thread) –34% of examined atoms are within cutoff distance ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5,

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 More Design Considerations for the Cutoff Kernel High memory throughput to atom data essential –Group threads together for locality –Fetch bins of data into shared memory –Structure atom data to allow fetching After taking care of memory demand, optimize to reduce instruction count –Loop and instruction-level optimization 22

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Another thread block runs while this one waits Tiling Atom Data Shared memory used to reduce Global Memory bandwidth consumption –Threads in a thread block collectively load one bin at a time into shared memory –Once loaded, threads scan atoms in shared memory –Reuse: Loaded bins used 128 times Threads individually compute potentials using bin in shared mem Collectively load next bin Write bin to shared memory Suspend Data returned from global memory Ready Time Execution cycle of a thread block 23

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Coalesced Global Memory Access to Atom Data Full global memory bandwidth only with 64- byte, 64-byte-aligned memory accesses –Each bin is exactly 128 bytes –Bins stored in a 3D array –32 threads in each block load one bin into shared memory, then processed by all threads in the block 128 bytes = 8 atoms (x,y,z,q) –Nearly uniform density of atoms in typical systems 1 atom per 10 Å 3 –Bins hold atoms from (4Å) 3 of space (example) –Number of atoms in a bin varies For water test systems, 5.35 atoms in a bin on average Some bins overfull 24

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Handling Overfull Bins In typical use, 2.6% of atoms exceed bin capacity Spatial sorting puts these into a list of extra atoms Extra atoms processed by the CPU –Computed with CPU-optimized algorithm –Takes about 66% as long as GPU computation –Overlapping GPU and CPU computation yields in additional speedup –CPU performs final integration of grid data 25

CPU Grid Data Integration Effect of overflow atoms are added to the CPU master energygrid array Slice of grid point values calculated by GPU are added into the master energygrid array while removing the padded elements ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, ,00,1 1,01,1 … …… … … 26

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 GPU Thread Coarsening Each thread computes potentials at four potential map points –Reuse x and z components of distance calculation –Check x and z components against cutoff distance (cylinder test) Exit inner loop early upon encountering the first empty slot in a bin 27

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 GPU Thread Inner Loop for (i = 0; i < BIN_DEPTH; i++) { aq = AtomBinCache[i].w; if (aq == 0) break; dx = AtomBinCache[i].x - x; dz = AtomBinCache[i].z - z; dxdz2 = dx*dx + dz*dz; if (dxdz2 < cutoff2) continue; dy = AtomBinCache[i].y - y; r2 = dy*dy + dxdz2; if (r2 < cutoff2) poten0 += aq * rsqrtf(r2); // Simplified example dy = dy - 2 * grid_spacing; /* Repeat three more times */ } Exit when an empty atom bin entry is encountered Cylinder test Cutoff test and potential value calculation 28

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Cutoff Summation Runtime 50k–1M atom structure size GPU cutoff with CPU overlap: 12x-21x faster than CPU core 29

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Cutoff Summation Speedup 50k–1M atom structure size Diminished overlap benefit due to limited queue size (16 entries) Cutoff summation alone 9-13  faster than CPU 30

Cutoff Summation Runtime GPU cutoff with CPU overlap: 17x-21x faster than CPU core Avoid overfilling asynchronous stream queues to maintain performance for larger problems 31 ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 GPU acceleration of cutoff pair potentials for molecular modeling applications. C. Rodrigues, D. Hardy, J. Stone, K. Schulten, W. Hwu. Proceedings of the 2008 Conference On Computing Frontiers, pp , 2008.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Summary Cutoff pair potentials heavily used in molecular modeling applications Use CPU to regularize the work given to the GPU to optimize its performance –GPU performs very well on 64-byte-aligned array data Run CPU and GPU concurrently to improve performance Use shared memory as a program-managed cache 32