CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

List Ranking and Parallel Prefix

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

Basic Algorithms on Arrays. Learning Objectives Arrays are useful for storing data in a linear structure We learn how to process data stored in an array.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Lecture 1: Introduction

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Waves!. Solving something like this… The Wave Equation (1-D) (n-D)

Discrete-Time Convolution Linear Systems and Signals Lecture 8 Spring 2008.

CMSC 104, Version 8/061L22Arrays1.ppt Arrays, Part 1 of 2 Topics Definition of a Data Structure Definition of an Array Array Declaration, Initialization,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Finite Impuse Response Filters. Filters A filter is a system that processes a signal in some desired fashion. –A continuous-time signal or continuous.

An Introduction to Programming with CUDA Paul Richmond

JS Arrays, Functions, Events Week 5 INFM 603. Agenda Arrays Functions Event-Driven Programming.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

1 Signals & Systems Spring 2009 Week 3 Instructor: Mariam Shafqat UET Taxila.

Fruitful functions. Return values The built-in functions we have used, such as abs, pow, int, max, and range, have produced results. Calling each of these.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

DISCRETE-TIME SIGNALS and SYSTEMS

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

GPU Architecture and Programming

Introduction to Data Structures and Algorithms CS 110: Data Structures and Algorithms First Semester,

Unit-V DSP APPLICATIONS. UNIT V -SYLLABUS DSP APPLICATIONS Multirate signal processing: Decimation Interpolation Sampling rate conversion by a rational.

Fourier Analysis of Discrete-Time Systems

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:

Advanced Computer Science Lesson 4: Reviewing Loops and Arrays Reading User Input.

CS161 Topic #16 1 Today in CS161 Lecture #16 Prepare for the Final Reviewing all Topics this term Variables If Statements Loops (do while, while, for)

QCAdesigner – CUDA HPPS project

Digital filters Honza Černocký, ÚPGM. Aliases Numerical filters Discrete systems Discrete-time systems etc. 2.

Wireless Communications with GNU SDR. Writing a signal processing block As mentioned earlier, the signal.

Design Realization lecture 22

 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Rohini Ravichandran Kaushik Narayanan A MINI STEREO DIGITAL AUDIO PROCESSOR (BEHAVIORAL MODEL)

Searching CSE 103 Lecture 20 Wednesday, October 16, 2002 prepared by Doug Hogan.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.

CS Class 04 Topics  Selection statement – IF  Expressions  More practice writing simple C++ programs Announcements  Read pages for next.

3/18/20161 Linear Time Invariant Systems Definitions A linear system may be defined as one which obeys the Principle of Superposition, which may be stated.

What is filter ? A filter is a circuit that passes certain frequencies and rejects all others. The passband is the range of frequencies allowed through.

Analysis of Linear Time Invariant (LTI) Systems

CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.

Chapter 2 System Models – Time Domain

Finite Impuse Response Filters. Filters A filter is a system that processes a signal in some desired fashion. –A continuous-time signal or continuous.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CS/EE 217 – GPU Architecture and Parallel Programming

Lecture 18 CUDA Program Implementation and Debugging

CS 179: GPU Programming Lecture 1: Introduction 1

Basic CUDA Programming

CS 179: GPU Programming Lecture 1: Introduction 1

CS 179: GPU Programming Lecture 1: Introduction 1

Lecture 2: Intro to the simd lifestyle and GPU internals

2008/11/24: Lecture 19 CMSC 104, Section 0101 John Y. Park

CS 179: GPU Programming Lecture 7.

2008/11/24: Lecture 19 CMSC 104, Section 0101 John Y. Park

CS 179: Lecture 12.

CS/EE 217 – GPU Architecture and Parallel Programming

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Presentation transcript:

CS 179: GPU Computing Lecture 3 / Homework 1

Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing: Groups of threads (grid, blocks, warps) Optimal parameter choice (#blocks, #threads/block) – Kernel practices: Robust handling of workload (beyond 1 thread/index)

Parallelization What are parallelizable problems?

Parallelization What are parallelizable problems? e.g. – Simple shading: for all pixels (i,j): replace previous color with new color according to rules – Adding two arrays: for (int i = 0; i < N; i++) C[i] = A[i] + B[i];

Parallelization What aren’t parallelizable problems? – Subtle differences!

Moving Averages

Moving Averages

Simple Moving Average x[n]: input (the raw signal) y[n]: simple moving average of x[n] Each point in y[n] is the average of the last K points!

Simple Moving Average

Exponential Moving Average

Comparison

Calculation for y[n] depends on calculation for y[n-1] !

Comparison SMA pseudocode: for i = 0 through N-1 y[n] <- x[n] x[n-(K-1)] EMA pseudocode: for i = 0 through N-1 y[n] <- c*x[n] + (1-c)*y[n-1] – Loop iteration i depends on iteration i-1 ! – Far less parallelizable!

Comparison SMA pseudocode: for i = 0 through N-1 y[n] <- x[n] x[n-(K-1)] – Better GPU-acceleration EMA pseudocode: for i = 0 through N-1 y[n] <- c*x[n] + (1-c)*y[n-1] – Loop iteration i depends on iteration i-1 ! – Far less parallelizable! – Worse GPU-acceleration

Morals Not all problems are parallelizable! – Even similar-looking problems Recall: Parallel algorithms have potential in GPU computing

Small-kernel convolution Homework 1 (coding portion)

Signals

Systems Given input signal(s), produce output signal(s)

Discretization Discrete samplings of continuous signals – Continuous audio signal -> WAV file – Voltage -> Voltage every T milliseconds (Will focus on discrete-time signals here)

Linear systems

Consider a tiny piece of the signal

Linear systems

Time-invariance

Time-invariance and linearity

Morals

Convolution example

Computability

This assignment Accelerate this computation! – Fill in TODOs on assignment 1 Kernel implementation Memory operations – We give the skeleton: CPU implementation (a good reference!) Output error checks h[n] (default is Gaussian impulse response) …

The code Framework code has two modes: – Normal mode (AUDIO_ON zero) Generates random x[n] Can run performance measurements on different sizes of x[n] Can run multiple repeated trials (adjust channels parmeter) – Audio mode (AUDIO_ON nonzero) Reads input WAV file as x[n] Outputs y[n] to WAV Gaussian is an imperfect low-pass filter – high frequencies attenuated!

Demonstration

Debugging tips Printf – Beware – you have many threads! – Set small number of threads to print Store intermediate results in global memory – Can copy back to host for inspection Check error returns! – gpuErrchk macro included – wrap around function calls

Debugging tips Use small convolution test case – E.g. 5-element x[n], 3-element h[n]

Compatibility Our machines: – haru.caltech.edu – (We’ll try to get more up as the assignment progresses) CMS machines: – Only normal mode works (Fine for this assignment) Your own system: – Dependencies: libsndfile (audio mode)

Administrivia Due date: – Wednesday, 3 PM (correction) Office hours (ANB 104): – Kevin/Andrew: Monday, 9-11 PM – Eric: Tuesday, 7-9 PM