CS 179: Lecture 12.

Slides:

Advertisements

Similar presentations

J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.

Advertisements

SSA and CPS CS153: Compilers Greg Morrisett. Monadic Form vs CFGs Consider CFG available exp. analysis: statement gen's kill's x:=v 1 p v 2 x:=v 1 p v.

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Waves!. Solving something like this… The Wave Equation (1-D) (n-D)

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CS179: GPU Programming Lecture 16: Final Project Discussion.

1 CSE 20 Lecture 13: Analysis of Recursive Functions CK Cheng.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

CS 179: GPU Programming Lab 7 Recitation: The MPI/CUDA Wave Equation Solver.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Lecture 3.4: Recursive Algorithms CS 250, Discrete Structures, Fall 2015 Nitesh Saxena Adopted from previous lectures by Zeph Grunschlag.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Processes and threads.

CSC 253 Lecture 9.

Unit 5 Lesson 3: Introduction to Arrays

INVERSE TRIGONOMETRIC FUNCTIONS

Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?

Basic CUDA Programming

Swapping Segmented paging allows us to have non-contiguous allocations

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

9. Inverse Trig Functions

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Stack Data Structure, Reverse Polish Notation, Homework 7

This pointer, Dynamic memory allocation, Constructors and Destructor

CS 179 Lecture 17 Options Pricing.

LINKED LISTS CSCD Linked Lists.

CSC 253 Lecture 8.

Pipeline parallelism and Multi–GPU Programming

CS 2308 Final Exam Review.

CS 179 Lecture 14.

CS Introduction to Operating Systems

CSC 253 Lecture 8.

Lecture 5: GPU Compute Architecture for the last time

Processor Management Damian Gordon.

CGS 3763 Operating Systems Concepts Spring 2013

Lecture#12: External Sorting (R&G, Ch13)

Chapter 26 Concurrency and Thread

Lecture 14: Inter-process Communication

The Curve Merger (Dvir & Widgerson, 2008)

Central Processing Unit

Review and Q/A.

Inverse Trig Functions

CS 179: Lecture 3.

EEC 688/788 Secure and Dependable Computing

Merge Sort 1/12/2019 5:31 PM Dynamic Programming Dynamic Programming.

EEC 688/788 Secure and Dependable Computing

Lecture 2- Query Processing (continued)

Dynamic Programming Dynamic Programming 1/18/ :45 AM

Dynamic Programming Merge Sort 1/18/ :45 AM Spring 2007

CS533 Concepts of Operating Systems Class 12

The Challenge of Cross - Language Interoperability

CS533 Concepts of Operating Systems Class 13

Data Structures & Algorithms

CS510 Operating System Foundations

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

EEC 688/788 Secure and Dependable Computing

CSE 153 Design of Operating Systems Winter 2019

9. Inverse Trig Functions

Dynamic Programming Merge Sort 5/23/2019 6:18 PM Spring 2008

GPU Scheduling on the NVIDIA TX2:

Processor Management Damian Gordon.

Light-Weight Process (Threads)

CS 2308 Final Exam Review.

Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2

Presentation transcript:

CS 179: Lecture 12

Problems thus far…

What’s ahead Wave equation solver Using multiple GPUs across different computers Inter-process communication - Message Passing Interface (MPI) The project Do something cool!

Waves!

Solving something like this…

The Wave Equation (1-D) (n-D) 𝜕 2 𝑦 𝜕 𝑡 2 = 𝑐 2 𝜕 2 𝑦 𝜕 𝑥 2 𝜕 2 𝑦 𝜕 𝑡 2 = 𝑐 2 𝜕 2 𝑦 𝜕 𝑥 2 𝜕 2 𝑦 𝜕 𝑡 2 = 𝑐 2 𝛻 2 𝑦

Discretization (1-D) 𝜕 𝜕𝑡 𝑦 𝑥,𝑡+1 − 𝑦 𝑥,𝑡 ∆𝑡 = 𝑐 2 𝜕 𝜕𝑥 𝑦 𝑥+1,𝑡 − 𝑦 𝑥,𝑡 ∆𝑥 → (𝑦 𝑥,𝑡+1 − 𝑦 𝑥,𝑡 )− (𝑦 𝑥,𝑡 − 𝑦 𝑥,𝑡−1 ) (∆𝑡) 2 = 𝑐 2 (𝑦 𝑥+1,𝑡 − 𝑦 𝑥,𝑡 )− (𝑦 𝑥,𝑡 − 𝑦 𝑥−1,𝑡 ) (∆𝑥) 2

Discretization 𝑦 𝑥,𝑡+1 =2 𝑦 𝑥,𝑡 − 𝑦 𝑥,𝑡−1 + 𝑐∆𝑡 ∆𝑥 2 (𝑦 𝑥+1,𝑡 −2 𝑦 𝑥,𝑡 + 𝑦 𝑥−1,𝑡 )

Boundary Conditions Examples: Manual motion at an end Bounded ends: u(0, t) = f(t) Bounded ends: u(0, t) = u(L, t) = 0 for all t

Pseudocode (general) General idea: We’ll need to deal with three states at a time (all positions at t -1, t, t+1) Let L = number of nodes (distinct discrete positions) Create a 2D array of size 3*L Denote pointers to where each region begins Cyclically overwrite regions you don’t need!

Pseudocode (general) fill data array with initial conditions for all times t = 0… t_max point old_data pointer where current_data used to be point current_data pointer where new_data used to be point new_data pointer where old_data used to be! (so that we can overwrite the old!) for all positions x = 1…up to number of nodes-2 calculate f(x, t+1) end set any boundary conditions! (every so often, write results to file)

GPU Algorithm - Kernel (Assume kernel launched at some time t…) Calculate y(x, t+1) Each thread handles only a few values of x! Similar to polynomial problem 𝑦 𝑥,𝑡+1 =2 𝑦 𝑥,𝑡 − 𝑦 𝑥,𝑡−1 + 𝑐∆𝑡 ∆𝑥 2 (𝑦 𝑥+1,𝑡 −2 𝑦 𝑥,𝑡 + 𝑦 𝑥−1,𝑡 )

GPU Algorithm – The Wrong Way Recall the old “GPU computing instructions”: Setup inputs on the host (CPU-accessible memory) Allocate memory for inputs on the GPU Copy inputs from host to GPU Allocate memory for outputs on the host Allocate memory for outputs on the GPU Start GPU kernel Copy output from GPU to host

GPU Algorithm – The Wrong Way fill data array with initial conditions for all times t = 0… t_max adjust old_data pointer adjust current_data pointer adjust new_data pointer allocate memory on the GPU for old, current, new copy old, current data from CPU to GPU launch kernel copy new data from GPU to CPU free GPU memory set any boundary conditions! (every so often, write results to file) end

Why it’s bad That’s a lot of CPU-GPU-CPU copying! We can do something better! Similar ideas to previous lab

GPU Algorithm – The Right Way allocate memory on the GPU for old, current, new Either: Create initial conditions on CPU, copy to GPU Or, calculate and/or set initial conditions on the GPU! for all times t = 0… t_max adjust old_data address adjust current_data address adjust new_data address launch kernel with the above addresses Set boundary conditions on CPU, copy to GPU Or, calculate and/or set boundary conditions on the GPU End free GPU memory

GPU Algorithm – The Right Way Everything stays on the GPU all the time! Almost…

Getting output What if we want to get a “snapshot” of the simulation? That’s when we GPU-CPU copy!

GPU Algorithm – The Right Way allocate memory on the GPU for old, current, new Either: Create initial conditions on CPU, copy to GPU Or, calculate and/or set initial conditions on the GPU! for all times t = 0… t_max adjust old_data address adjust current_data address adjust new_data address launch kernel with the above addresses Set boundary conditions on CPU, copy to GPU Or, calculate and/or set boundary conditions on the GPU Every so often, copy from GPU to CPU and write to file End free GPU memory