©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
Lecture 15 Today Transformations between coordinate systems 1.Cartesian to cylindrical transformations 2.Cartesian to spherical transformations.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Stencil-based Discrete Gradient Transform Using
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Introduction to Parallelism.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
Parallel Computation Patterns (Reduction)
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Mattan Erez The University of Texas at Austin
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
© David Kirk/NVIDIA and Wen-mei W. Hwu,
(non-Cartesian) trajectory with linear-solver-based reconstruction.
Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 7: Dealing with Non-Uniform Data

MRI Reconstruction  MRI scanners increasingly use spiral trajectories in a cylindrical or spherical coordinate system ⟹ The image cannot be reconstructed by directly applying IFFT to the k-space samples  Transform MR data samples from the k-space into the image space using IFFT 2

Gridding to Enable IFFT  Instead, map non-Cartesian samples in frequency domain onto a 3D Cartesian grid based on a Kaiser- Bessel function (Gridding)  Next, perform IFFT on grid to transform it to the image domain 3

Large Number of Sample Points Each sample contains –Its K-space Coordinates, s.coordinates –Its strength value, s.value Given s.coordinates, it is easy to calculate the range of grid points affected –Cutoff distance ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

An Input Oriented Sequential Code  Uses a window function for (every sample point s){ for(z in range){ for(y in range){ for(x in range){ weight = kaiser_bessel(| - |) grid[z][y][x] += s.value * weight; } 5

Binning of Sample Points For simplicity, we will use 1D gridding examples Each sample point has –s.x (will be represented with Bin#) –S.value (will be omitted unless necessary) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, Bin 1Bin 0Bin 2Bin 3Bin 4 cutoff distance

A Scatter Parallelization Use each thread to process N sample points Use Global Memory atomic operation to accumulate into grid points Each sample point affects all grid points with cutoff distance Slow, but not pathologically slow Fermi runs this faster than its predecessors ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, cutoff distance

A Faster Scatter Implementation Algorithm: –Sort input data –Each block processes one section of sample points Each thread processes a smaller section of sample points –Create window of grid points in shared memory and compute into it when possible Quick inspection of the end sample points of sample section determines window –Use atomic operation to coordinate across threads ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, Shared Memory Regularization helps scatter too.

Limited Space in Shared Memory The sample sections vary in the span of grid points they cover –The algorithms works best in sample points are mostly concentrated in small grid regions Limit the size of grid point window for Share Memory limitation and copy-merge overhead Moderate level of contention ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, Shared Memory

Accesses to Global Memory A condition test of bin value and widow span determines if an affected grid point is in Shared Memory window –Use atomic operation to accumulate into Global Memory –low contention At the end of block execution, thread collectively merge window back to Global Memory –Use atomic operation –Moderate to low contention ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, Shared Memory

How about Scatter Parallelization No contention We know we can bin sample points However, there can be great load imbalance –Some grid points are affected by many more sample points than others ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

A Binned Gather Parallelization Use each thread to compute the value of N grid points Pre-sort sample points into fixed size bins Each thread reads only the relevant input bins ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, XX Shared Memory X3XX4XX

A Tiled Gather Implementation ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 k Shared Memory X 0XX 3 X X 4 X X 5 X X 234 XXX XXX X X X X X X 7 X X X X X 9 X X

More on Tiled Gather Threads cooperate to load the relevant bins for all of them from Global Memory to Shared Memory Each thread only access relevant bins from Shared Memory Uniform binning for Non-uniform distribution –Large memory overhead for dummy cells –Reduced benefit of tiling –Many threads spend much time on dummy sample points ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Bin 4 Implicit Binning for Gather Parallelization  Don’t use pre-allocated fixed size bins (multi- dimensional array)  Sort samples into bins of varying sizes in input array instead  Bins 5, 6, 8 are implicit, zero-sample sort Bin 0Bin 2Bin 1Bin Bin 7Bin 9

Determine Start and End of Bins Use parallel scan operations to generate an array of starting points of all bins (CUDPP) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, Bin 0Bin 2Bin 1Bin 3Bin

A Tiled Gather Implementation ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 k Shared Memory

Controlling Load Balance Limit the size of each bin –Both uniform and variable/implicit bins –CPU places excess sample points into a CPU list –CPU does gridding on the excess sample points in parallel with GPU –Eventually merge ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, GPU CPU

19 Determining the bin-size limit can be tricky. Higher limit creates more load imbalance on GPU Lower limit may cause too much CPU execution time What is the best bin size? Sort-reduce … CPU GPU

20 There is a range of good bin sizes for each processor.

21 Performance Chart Partitioning achieves 43% speedup over sort- reduce without partitioning AlgorithmTime (sec.)Speedup Sequential23.441X Shared Atomic X Sort-Reduce X Sort-Reduce + Partitioning (binsize limit=128) X

22 Determining Appropriate Algorithm

The Answer is depends. No panacea here: best algorithm and optimal bin size depend on input data distribution!

ANY FURTHER QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010