Programming with CUDA WS 08/09 Lecture 11 Thu, 27 Nov, 2008.

Slides:

Advertisements

Similar presentations

List Ranking and Parallel Prefix

Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.

More on threads, shared memory, synchronization

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Programming with CUDA WS 08/09 Lecture 12 Tue, 02 Dec, 2008.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Weekly Report- Matrix multiplications Ph.D. Student: Leo Lee date: Oct. 16, 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

Programming with CUDA WS 08/09 Lecture 13 Thu, 04 Dec, 2008.

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Table of Contents Matrices - Multiplication Assume that matrix A is of order m  n and matrix B is of order p  q. To determine whether or not A can be.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

2.4 Inverse of Linear Transformations For an animation of this topic visit: Is the transformation.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

CS 312: Algorithm Analysis

University 100 Classroom Management and Instruction Workshop by Dr. Kathryn Hoover.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CS179: GPU Programming Lecture 16: Final Project Discussion.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

ME964 High Performance Computing for Engineering Applications “Once a new technology rolls over you, if you're not part of the steamroller, you're part.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Programming with CUDA WS 08/09 Lecture 1 Tue, 21 Oct, 2008.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Multiplication with Arrays and Boxes – A Visual Model.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Array Accessing and Strings ENGR 1187 MATLAB 3. Today's Topics  Array Addressing (indexing)  Vector Addressing (indexing)  Matrix Addressing (indexing)

Matrix Multiplication The Introduction. Look at the matrix sizes.

Table of Contents Matrices - Definition and Notation A matrix is a rectangular array of numbers. Consider the following matrix: Matrix B has 3 rows and.

Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS1101: Programming Methodology Aaron Tan.

BIS 303 Week 1 Individual Assignment Hospitality Information Technology Matrix Resource: Hospitality Information Technology Matrix Worksheet, located in.

CS/EE 217 – GPU Architecture and Parallel Programming

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Section 7: Memory and Caches

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Lecture 5: GPU Compute Architecture for the last time

CS 179 Project Intro.

بسم الله الرحمن الرحيم هل اختلف دور المعلم بعد تطبيق المنهج الحديث الذي ينادي بتوفير خبرات تعليمية مناسبة للطلبة ؟ هل اختلف دور المعلم ؟ ن.ن. ع.

Objectives Multiply two matrices.

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

Social psychology 2016 Course information.

EE 4xx: Computer Architecture and Performance Programming

Parallel build blocks.

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 5: Synchronization and ILP

No. Date Agenda 1 09/14/2012 Course Organization; [slides] Lecture 1 - What is Cloud Computing [slides] 2 09/21/2012 Lecture 2 - The Art of Concurrency.

Matrix Multiplication Sec. 4.2

Presentation transcript:

Programming with CUDA WS 08/09 Lecture 11 Thu, 27 Nov, 2008

Previously Optimizing your code Optimizing your code –Instruction throughput –Memory bandwidth –#Threads per block –Type of memory –General guidelines

Today Graded/ungraded course? Graded/ungraded course? 2 examples 2 examples –Matrix multiplication Straightforward Straightforward –Parallel reduction Final projects Final projects

Graded/ungraded

Matrix Multiplication Inherently parallel problem Inherently parallel problem C = A * B C = A * B A: hA x wA, B: hB x wB, C: hA x wB A: hA x wA, B: hB x wB, C: hA x wB Each entry in C depends on one row in A and one column in B Each entry in C depends on one row in A and one column in B –Assign each entry to a thread

Matrix Multiplication

C: hA x wB C: hA x wB First strategy: start a thead block with hA*wB threads First strategy: start a thead block with hA*wB threads –Too many threads per block!

Matrix Multiplication C: hA x wB C: hA x wB Better: block the problem Better: block the problem –Break C into BlockSize x BlockSize –Assign each block to a thread block Recall: recommended #threads/block = 192, 256 Recall: recommended #threads/block = 192, 256 A reasonable choice for BlockSize is 16 A reasonable choice for BlockSize is 16

Matrix Multiplication

Parallel Reduction Reduction Reduction –Reducing an array to a single value, e.g. sum, min, max Slides Slides

Final Projects Time-line Time-line –Thu, 20 Nov: Float write-ups on ideas of Jens & Waqar Float write-ups on ideas of Jens & Waqar –Tue, 25 Nov: Suggest groups and topics Suggest groups and topics –Thu, 27 Nov (today): Groups and topics assigned Groups and topics assigned –Tue, 2 Dec: Last chance to change groups/topics Last chance to change groups/topics Groups and topics finalized Groups and topics finalized

See you next week!