©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
ME964 High Performance Computing for Engineering Applications Spring 2011 Dan Negrut Assistant Professor Department of Mechanical Engineering University.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 ECE 498HK Computational Thinking for Many-core Computing Lecture 15: Dealing with Dynamic Data.
Reconfigurable Application Specific Computers RASCs Advanced Architectures with Multiple Processors and Field Programmable Gate Arrays FPGAs Computational.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
Acceleration on many-cores CPUs and GPUs Dinesh Manocha Lauri Savioja.
The Quantum 7 Dwarves Alexandra Kolla Gatis Midrijanis UCB CS
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.
gpucomputing.net is a research and development community site dedicated to fostering collaborative and interdisciplinary work on the various disciplines.
Topics Speedup Amdahl’s law Execution timeReadings January 10, 2012 CSCE 713 Computer Architecture.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.
GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1 ECE408/CS483 Applied Parallel Programming Lecture 10: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
SIGGRAPH 2011 ASIA Preview Seminar Rendering: Accuracy and Efficiency Shinichi Yamashita Triaxis Co.,Ltd.
1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
GPU Programming Shirley Moore CPS 5401 Fall 2013
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CS/EE 217 GPU Architecture and Parallel Programming Project.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Data Structures and Algorithms in Parallel Computing Lecture 7.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,
Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Martin Kruliš by Martin Kruliš (v1.0)1.
Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.
Mattan Erez The University of Texas at Austin
Key to Scalable Parallelism - Regularity and Locality
VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 5: Data Layout for Grid Applications ©Wen-mei W. Hwu and David Kirk/NVIDIA.
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CSC Classes Required for TCC CS Degree
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
Introduction to Heterogeneous Parallel Computing
Panel on Research Challenges in Big Data
Presentation transcript:

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture 9: Directions for Future Studies and Closing Remarks

Course Objective (Review) To master the most commonly used algorithm techniques and computational thinking skills needed for many-core programming –Especially the simple ones! Specifically, to understand –Many-core hardware limitations and constraints –Desirable and undesirable computation patterns –Commonly used algorithm techniques to convert undesirable computation patterns into desirable ones ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Agenda (Monday) 10:00 – 11:30Lecture 1 – Introduction to computational thinking for many-core computing 12:30 – 2:00Lecture 2 – Scatter-to-Gather transformation for scalability 3:00 – 4:30Lecture 3 - Loop blocking and register tiling for locality Lab 1 – 2D Blocking and Register tiling ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Agenda Review (Tuesday) Lecture 4 – Cut-off and Binning for regular data sets Lecture 5 – Data Layout for Grid Applications Keynote 1 – Michael Garland (NVIDIA) –Algorithm Design for GPU Computing Lab 1 – 2D Blocking and Register tiling Lab 2 - Data Layout Transformation for LBM Methods and Binning for Regular Data Sets ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Agenda Review (Wednesday) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Lecture 6 – Dealing with non-uniform data distribution Lecture 7 – Dealing with dynamic data sets –With guest lecture by Jonathan Cohen (NVIDIA) on PDE Solvers and OpenCurrent Keynote 2 – David Kirk (NVIDIA) –Fermi and Future of GPU Computing Technology Lab 2 - Data Layout Transformation for LBM Methods and Binning for Regular Data Sets Lab 3 – Binning for non-uniform data distribution

Agenda Review (Thursday) ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Keynote 3 – Lorena Barba (Boston University) –Multiplying speedups: fast algorithms on GPUs. A case study on biomolecular electrostatics. Lecture 8 – Transformations of key computation patterns –With guest lecture from Jeremy Meredith (Oak Ridge National Lab) on Accelerating HPC Applications with GPUs – Two Case Studies" Hands-on Lab 3 – Data binning for non-uniform data distribution Hands-on Lab wrap-up discussions

7 High-end simulation in the physical sciences = 7 numerical methods : 1.Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2.Unstructured Grids 3.Fast Fourier Transform 4.Dense Linear Algebra 5.Sparse Linear Algebra 6.Particles 7.Monte Carlo Well-defined targets from algorithmic, software, and architecture standpoint Phillip Colella’s “Seven dwarfs” If add 4 for embedded, covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Combinational logic 11. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

Seven Techniques in Manycore Programming Scatter to Gather transformation Granularity coarsening and register tiling Tiling for dense matrix data locality Data layout and traversal ordering Binning and cutoff Bin sorting and partitioning for non-uniform data Hierarchical queues and kernels for dynamic data ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

Benefit from other people’s experience GPU Computing Gems Vol 1 –Coming December 2010 –50 gems in 10 applications areas –Scientific simulation, life sciences, statistical modeling, emerging data-intensive applications, electronic design automation, computer vision, ray tracing and rendering, video and imaging processing, signal and audio processing, medical imaging GPU Computing Gems Vol 2 –Coming in April 2011 –50+ gems on more application domains, tools, environments ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

10 High-end simulation in the physical sciences = 7 numerical methods : 1.Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2.Unstructured Grids 3.Fast Fourier Transform 4.Dense Linear Algebra 5.Sparse Linear Algebra 6.Particles 7.Monte Carlo Well-defined targets from algorithmic, software, and architecture standpoint Phillip Colella’s “Seven dwarfs” If add 4 for embedded, covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Combinational logic 11. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004 Courtesy: David Patterson

How about your computation types? Are your applications very different from all the seven dwarfs? Can you think of any other algorithmic techniques that we should cover? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

You can do it. Computational thinking is not as hard as you may think it is. –The purpose of the course is to make them accessible to domain scientists and engineers. –With practice, you can make it look easy! ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

ANY MORE QUESTIONS? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010