© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

Slides:

Advertisements

Similar presentations

Parallel computer architecture classification

Advertisements

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Rethinking Computer Architecture Wen-mei Hwu University of Illinois, Urbana-Champaign Celebrating September 19, 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Final Project Kickoff.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

GPU Programming Shirley Moore CPS 5401 Fall 2013

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CS/EE 217 GPU Architecture and Parallel Programming Project.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Kickoff.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

1 CS/EE 217 GPU Architecture and Parallel Programming Lecture 9: Tiled Convolution Analysis © David Kirk/NVIDIA and Wen-mei W. Hwu,

Stencil-based Discrete Gradient Transform Using

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Mattan Erez The University of Texas at Austin

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

Introduction to Heterogeneous Parallel Computing

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 2 Objective Identify applications whose computing structures are suitable for massively parallel computing –These applications can be revolutionized by 100X more computing power –You have access to expertise needed to tackle these applications Develop algorithm patterns that can result in both better efficiency as well as better HW utilization

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 3 Future Science and Engineering Breakthroughs Hinge on Computing Computational Modeling Computational Chemistry Computational Medicine Computational Physics Computational Biology Computational Finance Computational Geoscience Image Processing

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 4 Faster is not “just Faster” 2-3X faster is “just faster” –Do a little more, wait a little less –Doesn’t change how you work 5-10x faster is “significant” –Worth upgrading –Worth re-writing (parts of) the application 100x+ faster is “fundamentally different” –Worth considering a new platform –Worth re-architecting the application –Makes new applications possible –Drives “time to discovery” and creates fundamental changes in Science

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 5 How much computing power is enough? Each jump in computing power motivates new ways of computing –Many apps have approximations or omissions that arose from limitations in computing power –Every 100x jump in performance allows app developers to innovate –Example: graphics, medical imaging, physics simulation, etc.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 6 A Great Opportunity for Many Massively parallel computing allows –Drastic reduction in “time to discovery” –New, 3 rd paradigm for research: computational experimentation –The “democratization of supercomputing” $2,000/Teraflop SPFP in personal computers today $5,000,000/Petaflops DPFP in clusters in two years HW cost will no longer be the main barrier for big science –This is once-in-a-career opportunity for many!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 7 The Pyramid of Parallel Programming Thousand-node systems with MPI-style programming, >100 TFLOPS, $M, allocated machine time (programmers numbered in hundreds) Hundred-core systems with CUDA- style programming, 1-5 TFLOPS, $K, machines widely availability (programmers numbered in 10s of thousands) Hundred-core systems with MatLab-style programming, GFLOPS, $K, machines widely available (programmers numbered in millions)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 8 UIUC ECE 498AL Spring 2007 Projects ApplicationDescriptionSourceKernel% time H.264 SPEC ‘06 version, change in guess vector 34, % LBM SPEC ‘06 version, change to single precision and print fewer reports 1,481285>99% RC5-72 Distributed.net RC5-72 challenge client code 1,979218>99% FEM Finite element modeling, simulation of 3D graded materials 1, % RPES Rye Polynomial Equation Solver, quantum chem, 2- electron repulsion 1, % PNS Petri Net simulation of a distributed system >99% SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 95231>99% TRACF Two Point Angular Correlation Function % FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1, % MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction 49033>99%

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 9 Speedup of GPU-Accelerated Functions GeForce 8800 GTX vs. 2.2GHz Opteron  speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25  to 400  speedup if the function’s data requirements and control flow suit the GPU and the application is optimized Keep in mind that the speedup also reflects how suitable the CPU is for executing the kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 10 Student Parallel Programming Results App.Archit. BottleneckSimult. TKernel XApp X LBM Shared memory capacity 3, RC5-72 Registers 3, FEM Global memory bandwidth 4, RPES Instruction issue rate 4, PNS Global memory capacity 2, LINPACK Global memory bandwidth, CPU-GPU data transfer 12, TRACF Shared memory capacity 4, FDTD Global memory bandwidth 1, MRI-Q Instruction issue rate 8, [HKR HotChips-2007]

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 11 Magnetic Resonance Imaging 3D MRI image reconstruction from non-Cartesian scan data is very accurate, but compute-intensive 416  speedup in MRI-Q (267.6 minutes on the CPU, 36 seconds on the GPU) –CPU – Athlon with fast math library MRI code runs efficiently on the GeForce 8800 –High-floating point operation throughput, including trigonometric functions –Fast memory subsystems Larger register file Threads simultaneously load same value from constant memory Access coalescing to produce < 1 memory access per thread, per loop iteration

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 12 Computing Q: Performance GPU (V8): 96 GFLOPS 416x CPU (V6): 230 MFLOPS.Total increase in performance from sequential DP to parallel SP is 1800x.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 13 MRI Optimization Space Search Unroll Factor Execution Time Each curve corresponds to a tiling factor and a thread granularity. GPU Optimal 145 GFLOPS

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 14 What you will likely need to hit hard. Parallelism extraction requires global understanding –Most programmers only understand parts of an application Algorithms need to be re-designed –Programmers benefit from clear view of the algorithmic effect on parallelism Real but rare dependencies often need to be ignored –Error checking code, etc., parallel code is often not equivalent to sequential code Getting more than a small speedup over sequential code is very tricky –~20 versions typically experimented for each application to move away from architecture bottlenecks