Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

Similar presentations

Presentation on theme: "CS179: GPU Programming Lecture 11: Lab 5 Recitation."— Presentation transcript:

1 CS179: GPU Programming Lecture 11: Lab 5 Recitation

2 Today  Monte-Carlo Integration  Recap on CUBLAS/CURAND  Reductions  Optimizing a reduction

3 Monte-Carlo Integration  Integration is a common tool is computational math  Oftentimes used for finding areas  Integration is hard on a computer  Difficult to do analytically  Integration is sometimes analytically impossible  Can’t integrate exp(x 2 ) analytically..

4 Monte-Carlo Integration  Could use discrete Riemann sum

5 Monte-Carlo Integration  What if there’s no predefined function?  Ex.: Area of union of shapes

6 Monte-Carlo Integration  Solution: Monte-Carlo Integration  Saturate bounded space with sample points  Check if each point is in any shape  Area = # of points in a shape / # of points total * area of space

7 Monte-Carlo Integration  Lab 5: Given N spheres in a bounded space, find the volume of their union  Possible to do analytically…  But very difficult!  Spheres have random positions, area of intersections, etc.  Makes good use of Monte-Carlo integration  Easy to check if a point is in any of the spheres  Easy to use CURAND to generate lots of points!

8 Lab 5  Remember: CURAND has host API and device API  You will use both!  volumeCUBLAS: uses host API with CUBLAS  volumeCUDA: uses device API with reduction kernel

9 Lab 5 volumeCUBLAS  Allocate necessary memory  Need memory for points  Need memory for 1 bool per point  Is point in any sphere?  Use CURAND host API to generate lots of points  Create, seed, generate, destroy  Use CheckPointsK kernel to see if each point is in a sphere  You must write this kernel!  Get total # of points in a sphere using cublasDasum  cublasDasum(int n, double *src, int stride)  Free initialized memory

10 Lab 5 volumeCUDA  Allocate memory for data  Now, we also need memory for curandStates!  Generate lots of points using CURAND device API  Call GenerateRandom3K kernel -- but you must fill in the kernel!  Check if points are in sphere  Same as volumeCUBLAS  Use reduction to sum vector  More on this later…  Free memory

11 Lab 5 Kernels  PointInSphere: Checks if a point is in a given sphere  Do this first!  Should be easy geometry  CheckPointsK: Checks if a point is in any sphere  Copy spheres to shared memory, then iterate through spheres  Remember to make sure array entry is non-NULL  GenerateRandom3K: Generates lots of float3 points  Use CURAND device API

12 Reduction  Iteratively reduces array via reduce function (ex. addition)

13 Reduction  Start with size = nPts / 2  Repeatedly call reduction on block size, halving it each time  With main loop in host, device code is very simple…  Just need to add element i and element i + size for each thread  Alternatively, could build loop into device code, and call kernel only once  Once size == 1, we should have summed up all elements

14 Reduction  Lots of optimizations to make!  Avoiding thread divergence  Contiguous memory accesses  Avoiding shared memory bank conflicts  More we haven’t discussed yet…  Unrolling loops  Templates  And more!

15 Optimizations  Avoiding thread divergence  Avoid calls that make different calls to threads in same warp  if(threadIdx.x % 2 == 0)  Instead, group by warps  if(threadIdx.x / WARP_SIZE == 0) 1234 23 10 74 274 1234 23 74 274 0 2 0 0 1 1

16 Optimizations  Contiguous memory accesses  Memory is linear, can’t swap dimensions  Need to address non-sequential accesses…  Shared memory banks  Also solved by sequential addressing! 1234 23 10 74 274 1234 23 74 274

17 Optimizations  Example in reduction kernel:  Reversed loop indexing  for (int i = 1; i < max_size; i *= 2) { … }  for (int i = max_size / 2; i > 0; i /= 2) { … } 1234 23 10 74 274 1234 23 74 274

18 Optimizations  Unrolling loops  Basic idea: when reduction size < 32, threads are wasting space due to warps  Unrolling last iteration of loop saves useless work

19 Optimizations  Unrolling loops example: for (int i = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; } for (int i = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; // etc… }

20 Optimizations  Advanced unrolling: templates  Exploit compiler to handle some conditions at compile-time  Use templated functions (like in C++)  Ex.: template __global__ void kernel(…) { if (blockSize >= 512) // some reduction code; else if (blockSize >= 256) // some reduction code; // etc… }  Then, call templated function on host:  kernel >>(…);

21 Optimizations  Works well with a switch statement: switch (numThreads) { case 512: kernel >>(…); case 256: kernel >>(…); case 128: kernel >>(…); // etc… }

Download ppt "CS179: GPU Programming Lecture 11: Lab 5 Recitation."

Similar presentations

Ads by Google