©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.
Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
GPU Programming Shirley Moore CPS 5401 Fall 2013
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
1 ECE 8823A GPU Architectures Module 4: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
ECE408/CS483 Applied Parallel Programming Lectures 5 and 6: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/ECE498al,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 1: Introduction and Computational Thinking

Course Objective To master the most commonly used algorithm techniques and computational thinking skills needed for many-core programming –Especially the simple ones! Specifically, to understand –Many-core hardware limitations and constraints –Desirable and undesirable computation patterns –Commonly used algorithm techniques to convert undesirable computation patterns into desirable ones ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Course Staff Instructors –Wen-mei Hwu (UIUC), David Kirk (NVIDIA), John Stratton (UIUC), John Stone (UIUC) Keynote Speakers –Michael Garland (NVIDIA), David Kirk (NVIDIA), Lorena Barba (BU) Guest lectures –Jonathan Cohen (NVIDIA), Jeremy Meredith (ORNL) UIUC Teaching Assistants –Xiao-Long Wu, Deepthi Nandakumar, Liwen Chang, Hee-seok Kim, Nasser Anssari, Local Coordinators, Teaching Assistants, Technical Staff ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and workstation – massive volume and potential impact Performance Advantage of GPUs Courtesy: John Owens ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

DRAM Cache ALU Control ALU DRAM CPU GPU CPUs and GPUs have fundamentally different design philosophies.

Harvesting Performance Benefit of Many-core GPU Requires Massive parallelism in application algorithms –Data parallelism Regular computation and data accesses –Similar work for parallel threads Avoidance of conflicts in critical resources –Off-chip DRAM (Global Memory) bandwidth –Conflicting parallel updates to memory locations ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Massive Parallelism - Regularity ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Main Hurdles to Overcome Serialization due to conflicting use of critical resources Over subscription of Global Memory bandwidth Load imbalance among parallel threads ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

“CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms, but enough to matter.” V. Natoli, Kudos for CUDA, HPCWire ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Computational Thinking Skills The ability to translate/formulate domain problems into computational models that can be solved efficiently by available computing resources –Understanding the relationship between the domain problem and the computational models –Understanding the strength and limitations of the computing devices –Designing the model implementations to steer away from the limitations

DATA ACCESS CONFLICTS ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Conflicting Data Accesses Cause Serialization and Delays Massively parallel execution cannot afford serialization Contentions in accessing critical data causes serialization ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

A Simple Example A naïve inner product algorithm of two vectors of one million elements each –All multiplications can be done in time unit (parallel) –Additions to a single accumulator in one million time units (serial) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 * * * * * +*+* +++ …… Time

How much can conflicts hurt? Amdahl’s Law –If fraction X of a computation is serialized, the speedup can not be more than 1/(1-X) In the previous example, X = 50% –Half the calculations are serialized –No more than 2X speedup, no matter how many computing cores are used ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

GLOBAL MEMORY BANDWIDTH ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Global Memory Bandwidth IdealReality ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Global Memory Bandwidth Many-core processors have limited off-chip memory access bandwidth compared to peak compute throughput GT200 –1 TFLOPS peak single precision peak throughput –150 GB/s peak off-chip memory access bandwidth or 37.5 G single-precision operands per second –To achieve peak throughput, a program must perform 1,000/37.5 = 27 floating point arithmetic operations for each operand value fetched from off-chip memory, ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Global Memory Bandwidth Fermi –1 TFLOPS SPFP peak throughput –0.5 TFLOPS DPFP peak throughput –144 GB/s peak off-chip memory access bandwidth 36 G SPFP operands per second 18 G DPFP operands per second –To achieve peak throughput, a program must perform 1,000/36 = ~28 SPFP (14 DPFP) arithmetic operations for each operand value fetched from off-chip memory ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

A Simple CUDA Kernel for Matrix Multiplication __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x; float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row][k] * Nd[k][Col]; Pd[Row][Col] = Pvalue; } ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on GT200? Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*1,000GFLOPS = 4,000 GB/s needed to achieve peak FLOP rating 150 GB/s limits the code at 37.5 GFLOPS ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory How about performance on Fermi? Two memory accesses (8 bytes) per floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*1,000GFLOPS = 4,000 GB/s needed to achieve peak SP FLOP rating 8*500GFLOPS = 4,000 GB/s needed to achieve peak DP FLOP rating 144 GB/s limits the code at 36 SP / 18 DP GFLOPS ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

REDUCING REDUNDANCY ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Computational Efficiency Parallel execution sometime requires doing redundant work Total parallel execution may result in too much redundant work and longer execution ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Unique Redundant 4-way parallel 2-way parallel Time

© John Stratton, David Kirk/NVIDIA and Wen-mei W. Hwu PUMPS Workshop, Barcelona, España, Direct Coulomb Summation (DCS) Where Redundancy Originates At each lattice point, sum potential contributions for all atoms in the simulated structure: potential[j] += charge[i] / (distance to atom[i]) Atom[i] Distance to Atom[i] Lattice point being evaluated

© John Stratton, David Kirk/NVIDIA and Wen-mei W. Hwu PUMPS Workshop, Barcelona, España, DCS Initial Kernel Structure … float curenergy = energygrid[outaddr]; // start global mem read very early float coorx = gridspacing * xindex; float coory = gridspacing * yindex; int atomid; float energyval=0.0f; for (atomid=0; atomid<numatoms; atomid++) { float dx = coorx - atominfo[atomid].x; float dy = coory - atominfo[atomid].y; energyval += atominfo[atomid].w * (1.0f / sqrtf(dx*dx + dy*dy + atominfo[atomid].z)); } energygrid[outaddr] = curenergy + energyval;

© John Stratton, David Kirk/NVIDIA and Wen-mei W. Hwu PUMPS Workshop, Barcelona, España, Implementation Analysis Parallelism: each potential lattice point is processed in a different thread Redundancy: distance calculation (dx^2+dy^2+dz^2) –In this case, one kernel processes all elements of a fixed z coordinate, a 2D “slice” of the full 3D lattice –dz^2 is precomputed for every atom The primary bottleneck of the first kernel is instruction issue –Too many instructions for indexes and counting loops –Too much redundant computation among threads in x and y

LOAD BALANCE ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Load Balance The total amount of time to complete a parallel job is limited by the thread that takes the longest to finish ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 goodbad

How bad can it be? Assume that a job takes 100 units of time for one person to finish –If we break up the job into 10 parts of 10 units each and have fo10 people to do it in parallel, we can get a 10X speedup –If we break up the job into 50, 10, 5, 5, 5, 5, 5, 5, 5, 5 units, the same 10 people will take 50 units to finish, with 9 of them idling for most of the time. We will get no more than 2X speedup. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

How does imbalancing come about? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Non-uniform data distributions –Highly concentrated spatial data areas –Astronomy, medical imaging, computer vision, rendering, … If each thread processes the input data of a given spatial volume unit, some will do a lot more work than others

Known Algorithm Techniques Increasing locality in dense arrays –tiling of data access and layout Improving efficiency and vectorization in dense arrays –granularity coarsening Reducing output interference –conversion from scatter to gather –parallelizing reductions and histograms ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Known Algorithm Techniques (cont.) Dealing with non-uniform data –data sorting and binning Dealing with sparse data –sorting and packing Dealing with dynamic data –parallel queue-based algorithms Improving data efficiency in large data traversal –stencil and other grid-based computation ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

You can do it. Computational thinking is not as hard as you may think it is. –Most techniques have been explained, if at all, at the level of computer experts. –The purpose of the course is to make them accessible to domain scientists and engineers. ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Agenda (Monday) 10:00 – 11:30Lecture 1 – Introduction to computational thinking for many-core computing 12:30 – 2:00Lecture 2 – Scatter-to-Gather transformation for scalability 3:00 – 4:30Lecture 3 - Loop blocking and register tiling for locality Lab 1 – 2D Blocking and Register tiling ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Agenda (Tuesday) Lecture 4 – Cut-off and Binning for regular data sets Lecture 5 – Data Layout for Grid Applications Keynote 1 – Michael Garland (NVIDIA) –Algorithm Design for GPU Computing Lab 1 – 2D Blocking and Register tiling Lab 2 - Data Layout Transformation for LBM Methods and Binning for Regular Data Sets ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Agenda (Wednesday) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Lecture 6 – Dealing with non-uniform data distribution Lecture 7 – Dealing with dynamic data sets –With guest lecture by Jonathan Cohen (NVIDIA) on PDE Solvers and OpenCurrent Keynote 2 – David Kirk (NVIDIA) –Fermi and Future of GPU Computing Technology Lab 2 - Data Layout Transformation for LBM Methods and Binning for Regular Data Sets Lab 3 – Binning for non-uniform data distribution

Agenda (Thursday) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Keynote 3 – Lorena Barba (Boston University) –Multiplying speedups: fast algorithms on GPUs. A case study on biomolecular electrostatics. Lecture 8 – Transformations of key computation patterns –With guest lecture from Jeremy Meredith (Oak Ridge National Lab) on Accelerating HPC Applications with GPUs – Two Case Studies" Hands-on Lab 3 – Data binning for non-uniform data distribution Hands-on Lab wrap-up discussions

Agenda (Friday) Lecture 9: Directions for Further Studies Closing remarks Certificate and student feedback online survey ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

ANY MORE QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010