© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign 1 CS/EE 217 GPU Architecture and Parallel Programming Project.

Slides:

Advertisements

Similar presentations

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 14: Basic Parallel Programming Concepts.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Chapter 12: Simulation and Modeling Invitation to Computer Science, Java Version, Third Edition.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.

UIUC CSL Global Technology Forum © NVIDIA Corporation 2007 Computing in Crisis: Challenges and Opportunities David B. Kirk.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Parallel and Distributed Simulation Introduction and Motivation.

Games Development 2 Concurrent Programming CO3301 Week 9.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Final Project Kickoff.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

GPU Programming Shirley Moore CPS 5401 Fall 2013

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Sunpyo Hong, Hyesoon Kim

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Kickoff.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

The University of Texas at Austin

CS 179 Project Intro.

University of Wisconsin-Madison

Introduction to Heterogeneous Parallel Computing

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CS/EE 217 GPU Architecture and Parallel Programming Project Kickoff

Two flavors Application –Implement/optimize an realistic application on GPGPUs Architecture –Evaluate application performance at the architecture level and sensitivity to architectural features –Propose new features or algorithms 2

Two tracks Basic project—essentially a larger lab. –Should involve significant coding/optimization but… – does not necessarily have to be original or research related work –Could just implement one of the default projects Research project –Should review related work –Should have a goal of publishing a paper Even if you don’t get there, that is ok –More formal project proposal –In return, you don’t have to take the final exam! 3

4 Application projects: Objective To Build up your ability to translate parallel computing power into science and engineering breakthroughs –Identify applications whose computing structures are suitable for massively parallel execution –10-100X more computing power can have transformative effect on these applications –Much better if you have a client You can be the client – bringing an idea from your own research is encouraged. Develop algorithm patterns that can result in both better efficiency as well as scalability

5 Future Science and Engineering Breakthroughs Hinge on Computing Computational Modeling Computational Chemistry Computational Medicine Computational Physics Computational Biology Computational Finance Computational Geoscience Image Processing

GPU computing is catching on.

280 submissions to GPU Computing Gems ~90 articles included in two volumes Financial Analysis Scientific Simulation Engineering Simulation Data Intensive Analytics Medical Imaging Digital Audio Processing Computer Vision Digital Video Processing Biomedical Informatics Electronic Design Automation Statistical Modeling Ray Tracing Rendering Interactive Physics Numerical Methods 7

8 Faster is not “just Faster” 2-3X faster is “just faster” –Do a little more, wait a little less –Doesn’t change how you work 5-10x faster is “significant” –Worth upgrading –Worth re-writing (parts of) the application 100x+ faster is “fundamentally different” –Worth considering a new platform –Worth re-architecting the application –Makes new applications possible –Drives “time to discovery” and creates fundamental changes in Science

9 How much computing power is enough? Each jump in computing power motivates new ways of computing –Many apps have approximations or omissions that arose from limitations in computing power –Every 10x jump in performance allows app developers to innovate –Example: graphics, medical imaging, physics simulation, etc. Application developers do not take it seriously until they see real results.

10 Why didn’t this happen earlier? Computational experimentation is just reaching critical mass –Simulate large enough systems –Simulate long enough system time –Simulate enough details Computational instrumentation is also just reaching critical mass –Reaching high enough accuracy –Cover enough observations

11 A Great Opportunity for Many New massively parallel computing is enabling –Drastic reduction in “time to discovery” –1 st principle-based simulation at meaningful scale –New, 3 rd paradigm for research: computational experimentation The “democratization” of power to discover $2,000/Teraflop in personal computers today $5,000,000/Petaflops in clusters today HW cost will no longer be the main barrier for big science This is once-in-a-career opportunity for many!

12 VMD/NAMD Molecular Dynamics 240X speedup over sequential code Computational biology

13 15X with MATLAB CPU+GPU Pseudo-spectral simulation of 2D Isotropic turbulence Matlab: Language of Science

14 Prevalent Performance Limits Some microarchitectural limits appear repeatedly across the benchmark suite: Global memory bandwidth saturation –Tasks with intrinsically low data reuse, e.g. vector-scalar addition or matrix- vector multiplication product –Computation with frequent global synchronization Converted to short-lived kernels with low data reuse Common in simulation programs Thread-level optimization vs. latency tolerance –Since hardware resources are divided among threads, low per-thread resource use is necessary to furnish enough simultaneously-active threads to tolerate long- latency operations –Making individual threads faster generally increases register and/or shared memory requirements –Optimizations trade off single-thread speed for exposed latency

15 What you will likely need to hit hard. Parallelism extraction requires global understanding –Most programmers only understand parts of an application Algorithms need to be re-designed –Algorithmic effect on parallelism and locality is often hard to maneuver Real but rare dependencies often need to be pushed aside –Error checking code, etc., parallel code is often not equivalent to sequential code Getting more than a small speedup over sequential code is very tricky –~20 versions typically experimented for each application

Ideas for projects Your own research! An application you are interested in. GPU computing Gems Parboil benchmark: Collections of best parallel programming projects for CPU/GPU computing Rodinia Benchmark Talk to me if you are interested in architecture projects 16

Architecture Proposals Either a workload evaluation study running workloads in a simulator –Explore impact of different architecture parameters on performance –Provide insights into what causes the differences in behavior Propose and evaluate improvements –E.g., warp scheduler competition warpwavefront-scheduling-championship/ 17

18 Project website come up soon Deadlines and details of deliverables Standard programming project Standard architecture project (scheduler competition)

ANY MORE QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 19