L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

CS6235 Review for Midterm. 2 CS6235 Administrative Pascal will meet the class on Wednesday -I will join at the beginning for questions on test Midterm.

L19: Advanced CUDA Issues November 10, Administrative CLASS CANCELLED, TUESDAY, NOVEMBER 17 Guest Lecture, November 19, Ganesh Gopalakrishnan Thursday,

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

L17: CUDA, cont. October 28, Final Project Purpose: -A chance to dig in deeper into a parallel programming model and explore concepts. -Present.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CUDA programming Performance considerations (CUDA best practices)

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Single Instruction Multiple Threads

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

L4: Memory Hierarchy Optimization II, Locality and Data Placement

L15: CUDA, cont. Memory Hierarchy and Examples

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

Mattan Erez The University of Texas at Austin

General Purpose Graphics Processing Units (GPGPUs)

Review for Midterm.

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

L15: Review for Midterm

Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review for Midterm

Outline Questions on proposals? More on strsm Review for Midterm – Describe planned exam – Go over syllabus L15: Review for Midterm

Midterm Exam March 31 Goal is to reinforce understanding of CUDA and NVIDIA architecture Material will come from lecture notes and assignments In class, should not be difficult to finish Open notes, but no computers L15: Review for Midterm

Parts of Exam I.Definitions – A list of 10 terms you will be asked to define II. Constraints and Architecture -Understand constraints on numbers of threads, blocks, warps, size of storage -Understand basic GPU architecture: processors and memory hierarchy III.Problem Solving -Analyze data dependences and data reuse in code and use this to guide CUDA parallelization and memory hierarchy mapping -Given some CUDA code, indicate whether global memory accesses will be coalesced and whether there will be bank conflicts in shared memory -Given some CUDA code, add synchronization to derive a correct implementation -Given some CUDA code, provide an optimized version that will have fewer divergent branches -Given some CUDA code, derive a partitioning into threads and blocks that does not exceed various hardware limits IV.(Brief) Essay Question -Pick one from a set of 4 L15: Review for Midterm

How Much? How Many? How many threads per block? Max 512 How many blocks per grid? Max How many threads per warp? 32 How many warps per multiprocessor? 24 How much shared memory per streaming multiprocessor? 16Kbytes How many registers per streaming multiprocessor? 8192 (G80)/16384 (GTX/Tesla) Size of constant cache: 8Kbytes L15: Review for Midterm

Syllabus L1: Introduction and CUDA Overview Not much there… L2: Hardware Execution Model Difference between a parallel programming model and a hardware execution model SIMD, MIMD, SIMT, SPMD Performance impact of fine-grain multithreaded architecture What happens during the execution of a warp? How are warps selected for execution (scoreboarding)? L3: Writing Correct Programs Race condition, dependence What is a reduction computation and why is it a good match for a GPU? What does __syncthreads () do? (barrier synchronization) Atomic operations Memory Fence Instructions Device emulation mode L4 & L5: Memory Hierarchy: Locality and Data Placement Memory latency and memory bandwidth optimizations Reuse and locality What are the different memory spaces on the device, who can read/write them? How do you tell the compiler that something belongs in a particular memory space? Tiling transformation (to fit data into constrained storage): Safety and profitability L15: Review for Midterm

Syllabus L6 & L7: Memory Hierarchy III: Memory Bandwidth Optimization Leftover from L5 -- Tiling (for registers) Bandwidth – maximize utility of each memory cycle Memory accesses in scheduling (half-warp) Understanding global memory coalescing (for compute capability 1.2) Understanding shared memory bank conflicts L8: Control Flow Divergent branches Execution model Warp vote functions L9: Floating Point Single precision versus double precision IEEE Compliance: which operations are compliant? Intrinsics vs. arithmetic operations, what is more precise? What operations can be performed in 4 cycles, and what operations take longer? L10 & L11: Dense Linear Algebra on GPUs What are the key ideas contributing to CUBLAS 2.0 performance Concepts: high thread count vs. coarse-grain threads. When to use each? Transpose in shared memory plus padding trick L12: Sparse Linear Algebra on GPUS Different sparse matrix representations Stencil computations using sparse matrices L13&L14: Application case studies Host tiling for constant cache (plus data structure reorganization) Replacing trig function intrinsic calls with hardware implementations Global synchronization for MPM/GIMP L15: Review for Midterm