L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Optimization on Kepler Zehuan Wang

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L11: Jacobi, Tools and Project CS6963. Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

CUDA and the Memory Model (Part II). Code executed on GPU.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Engg, IIT(BHU)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Lecture 5: GPU Compute Architecture

CS 179: GPU Programming Lecture 7.

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

CS/EE 217 – GPU Architecture and Parallel Programming

L4: Memory Hierarchy Optimization II, Locality and Data Placement

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Review for Midterm.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

Presentation transcript:

L13: Review for Midterm

Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison, NVIDIA Topic: Interoperability between CUDA and Rendering on GPUs March 25, MIDTERM in class

Outline Questions on proposals? – Discussion of MPM/GIMP issues Review for Midterm – Describe planned exam – Go over syllabus – Review L4: execution model

Reminder: Content of Proposal, MPM/GIMP as Example I.Team members: Name and a sentence on expertise for each member Obvious II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page Straightforward adaptation from MPM presentation and/or code III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible Can measure sequential code Remove “history” function Phil will provide us with a scaled up computation that fits in 512MB CS L10: Floating Point

Reminder: Content of Proposal, MPM/GIMP as Example III.Suitability for GPU acceleration, cont. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. Some challenges on boundaries between nodes in grid -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. Measure grid and patches to discover data footprint. Consider ways to combine computations to reduce copying overhead. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? Importance of computation, and challenges in partitioning computation, dealing with scope, managing copying overhead -Point to any difficulties you anticipate at present in achieving high speedup See previous CS L10: Floating Point

Midterm Exam Goal is to reinforce understanding of CUDA and NVIDIA architecture Material will come from lecture notes and assignments In class, should not be difficult to finish

Parts of Exam I.Definitions – A list of 10 terms you will be asked to define II. Constraints -Understand constraints on numbers of threads, blocks, warps, size of storage III.Problem Solving -Derive distance vectors for sequential code and use these to transform code to CUDA, making use of constant memory -Given some CUDA code, indicate whether global memory accesses will be coalesced and whether there will be bank conflicts in shared memory -Given some CUDA code, add synchronization to derive a correct implementation -Given some CUDA code, provide an optimized version that will have fewer divergent branches -Given some CUDA code, derive a partitioning into threads and blocks that does not exceed various hardware limits IV.(Brief) Essay Question -Pick one from a set of 4

How Much? How Many? How many threads per block? Max 512 How many blocks per grid? Max How many threads per warp? 32 How many warps per multiprocessor? 24 How much shared memory per streaming multiprocessor? 16Kbytes How many registers per streaming multiprocessor? 8192 Size of constant cache: 8Kbytes

Syllabus L1 & L2: Introduction and CUDA Overview * Not much there… L3: Synchronization and Data Partitioning What does __syncthreads () do? Indexing to map portions of a data structure to a particular thread L4: Hardware and Execution Model How are threads in a block scheduled? How are blocks mapped to streaming multiprocessors? L5: Dependence Analysis and Parallelization Constructing distance vectors Determining if parallelization is safe L6: Memory Hierarchy I: Data Placement What are the different memory spaces on the device, who can read/write them? How do you tell the compiler that something belongs in a particular memory space?

Syllabus L7: Memory Hierarchy II: Reuse and Tiling Safety and profitability of tiling L8: Memory Hierarchy III: Memory Bandwidth Understanding global memory coalescing (for compute capability 1.2) Understanding memory bank conflicts L9: Control Flow Divergent branches Execution model L10: Floating Point Intrinsics vs. arithmetic operations, what is more precise? What operations can be performed in 4 cycles, and what operations take longer? L11: Tools: Occupancy Calculator and Profiler How do they help you?

Next Time March 23: – Guest Lecture, Austin Robison March 25: – MIDTERM, in class