L11: Jacobi, Tools and Project CS6963. Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out.

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.
GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies CS6963.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Project 1CS-4513, D-Term Programming Project #1 Concurrent Game of Life Due Friday, March 20.
L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.
CIS 260 Computer Programming I in C Prof. Timothy Arndt.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
CS190/295 Programming in Python for Life Sciences: Lecture 1 Instructor: Xiaohui Xie University of California, Irvine.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
GPU Architecture and Programming
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
CUDA - 2.
Cellular Automata Semester Project for Parallel Computing Group Members: Bibrak Qamar Jahanzeb Maqbool Muhammad Imran Bilawal Sarwar Mehreen Nadeem Mid.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.
CSE 3322 Computer Architecture Dr. John Patterson 614 NH Office Hours: M, W 11 –12 noon Grading Policy: Project 25%
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.
CS/EE 217 – GPU Architecture and Parallel Programming
CS 179: GPU Programming Lecture 1: Introduction 1
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
CSE 3322 Computer Architecture
CS 179: GPU Programming Lecture 1: Introduction 1
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.
L18: CUDA, cont. Memory Hierarchy and Examples
CS/EE 217 – GPU Architecture and Parallel Programming
L4: Memory Hierarchy Optimization II, Locality and Data Placement
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.
L5: Memory Hierarchy, III
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
6- General Purpose GPU Programming
Optimizing single thread performance
ICS103 Programming in C 1: Overview of Computers And Programming
Writing Cache Friendly Code
Presentation transcript:

L11: Jacobi, Tools and Project CS6963

Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out later this week Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) – Due 5PM, Friday, March 6 – Where are we? CS L11: Tools

Outline Jacobi – How to approach – Things to take care of Programming Tools for CUDA – Occupancy Calculator – Profiler Project – Discuss MPM/GIMP – Construct list of questions CS L11: Tools

Jacobi Tiling Sequential C code: … #define n 64 float a[n][n][n], b[n][n][n]; for (i=1; i<n-1; i++) for (j=1; j<n-1; j++) for (k=1; k<n-1; k++) { a[i][j][k]=0.8*(b[i-1][j][k]+b[i+1][j][k]+b[i][j-1][k] + b[i][j+1][k]+b[i][j][k-1]+b[i][j][k+1]); } CS L11: Tools

Analyze Reuse of Data Iteration space: for i, 1 <= i <= n-1 for j, 1 <= j <= n-1 for k, 1 <= k <= n-1 Access expressions: b[i-1][j][k], b[i+1][j][k], b[i][j-1][k], b[i][j+1][k], b[i][j][k-1], b[i][j][k+1] Reference PairDist.Reference PairDist.Reference PairDist. b[i+1][j][k] b[i-1][j][k] 2,0,0 b[i+1][j][k] b[i][j-1][k]1,1,0b[i][j][k-1] b[i][j-1j][k] 0,1,-1 b[i][j-1][k] b[i-1][j][k]1,-1,0b[i+1][j][k] b[i][j+1][k]1,-1,0b[i][j][k+1] b[i][j-1][k]0,1,1 b[i][j+1][k] b[i-1][j][k]1,1,0b[i+1][j][k] b[i][j][k-1]1,0,1b[i][j+1][k] b[i][j][k-1]0,1,1 b[i][j][k-1] b[i-1][j][k]1,0,-1b[i+1][j][k] b[i][j][k+1]1,0,-1b[i][j+1][k] b[i][j][k+1]0,1,-1 b[i][j][k+1] b[i-1][j][k]1,0,1b[i][j+1][k] b[i][j-1][k]0,2,0b[i][j][k+1] b[i][j][k-1]0,0,2 CS L11: Tools

Tiled Sequential Code … #define n 64 float a[n][n][n], b[n][n][n]; for (ii=1; ii<n-1; ii+=TI) for (jj=1; jj<n-1; jj+=TJ) for (kk=1; kk<n-1; kk+=TK) for (i=ii; i<min(n-1,ii+TI); i++) for (j=jj; j<min(n-1,jj+TJ); j++) for (k=kk; k<min(n-1,kk+TK); k++) { a[i][j][k]=0.8*(b[i-1][j][k]+b[i+1][j][k]+b[i][j-1][k] + b[i][j+1][k]+b[i][j][k-1]+b[i][j][k+1]); } CS L11: Tools

Relate Tiling Strategy to Computation/Data Partitioning Blocks can correspond to tiles (at least in 2 dimensions) Threads can correspond to 3-d loop nest Dealing with capacity limitations CS L11: Tools

Correctness: Boundaries Compute 3-D tileRead Additional Data +1 Must include boundaries in tile CS L11: Tools

Basic Structure of GPU Kernel __shared__ b_l[][][]; // each dimension of a grid corresponds to a tile // each dimension of a thread corresponds to i,j,k loop for (portion of third dimension) { // copy single element to shared memory // copy boundaries to shared memory __ syncthreads(); // compute Jacobi within tile __ syncthreads(); } // go to next tile in third dimension CS L11: Tools

Performance Expectations? Host-only original Global memory Shared memory CS L11: Tools

Tools: Occupancy Calculator Assists in calculating how many threads and blocks to use in the computation partitioning – Points to resource limitations – Points to underutilized resources Download from: – CS L11: Tools

Using the Occupancy Calculator First, what is the “compute capability” of your device? (see Appendix A of programmer’s guide) – Most available devices are 1.1 – GTX and Tesla are 1.3 Second, compile code with special flag to obtain feedback from compiler – --ptxas-options=-v CS L11: Tools

Example of Compiler Statistics for Occupancy Calculator $ nvcc --ptxas-options=-v \ -I/Developer/CUDA/common/inc \ -L/Developer/CUDA/lib mmul.cu -lcutil Returns: ptxas info : Compiling entry function '__globfunc__Z12mmul_computePfS_S_i' ptxas info : Used 9 registers, bytes smem, 8 bytes cmem[1] CS L11: Tools

Next Tool: Profiler What it does: – Provide access to hardware performance monitors – Pinpoint performance issues and compare across implementations Two interfaces: – Text-based: Built-in and included with compiler – GUI: Download from CS L11: Tools

A Look at MMUL Where are coalesced and non-coalesced loads and stores to global memory? X= CS L11: Tools

Another example Reverse array from Dr. Dobb’s journal Reverse_global Copy from global to shared, then back to global in reverse order Reverse_shared Copy from global to reverse shared and rewrite in order to global Output – CS L11: Tools

MPM/GIMP Questions from Last Time Lab machine set up? Python? Gnuplot? Hybrid data structure to deal with updates to grid in some cases and particles in other cases CS L11: Tools

Some Strategies and Methodologies Note that README.txt tells how to run “impact.exe”, using small number of cpp files Convert to single precision by replacing all “double” with “float” Discussion of data structures See updateContribList in shape.cpp and patch.h CS L11: Tools