Midterm Project Toby Heyn ME964 11/18/2008.

Slides:



Advertisements
Similar presentations
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
Advertisements

Week 14 - Monday.  What did we talk about last time?  Bounding volume/bounding volume intersections.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Discrete Element Method Midterm Project: Option 1 1.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
GPU Broad Phase Collision Detection GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
CSC 211 Data Structures Lecture 13
1 CSC 427: Data Structures and Algorithm Analysis Fall 2008 Algorithm analysis, searching and sorting  best vs. average vs. worst case analysis  big-Oh.
Collision Detection Design & Final Project Topic Brandon Smith November 5, 2008 ME 964.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 12 Parallel Computation.
Algorithm Design Techniques, Greedy Method – Knapsack Problem, Job Sequencing, Divide and Conquer Method – Quick Sort, Finding Maximum and Minimum, Dynamic.
CS/EE 217 – GPU Architecture and Parallel Programming
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Chapter 2 Memory and process management
CSC 427: Data Structures and Algorithm Analysis
CSC 427: Data Structures and Algorithm Analysis
Analysis of Algorithms
Chapter 7 Part 1 Edited by JJ Shepherd
Data Structures Interview / VIVA Questions and Answers
Hashing CENG 351.
Review Graph Directed Graph Undirected Graph Sub-Graph
Algorithm Analysis CSE 2011 Winter September 2018.
Lecture 07 More Repetition Richard Gesick.
Lecture 2: Intro to the simd lifestyle and GPU internals
Parallel Sorting Algorithms
Hash functions Open addressing
Data Structures and Algorithms
Sorting.
Lecture 4B More Repetition Richard Gesick
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Chapter 8: Main Memory.
Multi - Way Number Partitioning
Algorithm Design and Analysis (ADA)
Binary Search Back in the days when phone numbers weren’t stored in cell phones, you might have actually had to look them up in a phonebook. How did you.
Arrays, For loop While loop Do while loop
Algorithm design and Analysis
Dynamic Programming.
Unit-2 Divide and Conquer
CS/EE 217 – GPU Architecture and Parallel Programming
Java Programming Arrays
Sorting … and Insertion Sort.
Parallel Computation Patterns (Reduction)
Lecture 3: Main Memory.
25 Searching and Sorting Many slides modified by Prof. L. Lilien (even many without an explicit message indicating an update). Slides added or modified.
Parallel Sorting Algorithms
24 Searching and Sorting.
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
Database Systems (資料庫系統)
Sorting "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting Hat, Harry Potter.
ECE 498AL Lecture 15: Reductions and Their Implementation
Analysis of Algorithms
CSC 427: Data Structures and Algorithm Analysis
Programming Logic and Design Fifth Edition, Comprehensive
ECE 498AL Lecture 10: Control Flow
Data Structures & Algorithms
CENG 351 Data Management and File Structures
ECE 498AL Spring 2010 Lecture 10: Control Flow
Software Development Techniques
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Virtual Memory 1 1.
Presentation transcript:

Midterm Project Toby Heyn ME964 11/18/2008

Collision Detection Goal: Given a distribution of spheres in space, determine some information about the collisions between bodies Algorithm: Spatial Subdivision

Spatial Subdivision Spatial Subdivision Partition space into uniform grid (cells) Size of cell based on largest object For each object, determine which cells the object overlaps Objects can only collide if they occupy the same cell

Spatial Subdivision Construct Cell ID Array Sort Cell ID Array Each thread determines the cell IDs of the cells its sphere occupies, loads into Cell ID Array Sort Cell ID Array Radix Sort Algorithm Create Collision Cell List Scan sorted Cell ID Array, look for changes in cell ID Write Collision Cell List with Cell ID Array indices, number of objects in the cell Traverse Collision Cell List One thread per Collision Cell Each thread checks all collision pairs in the Collision Cell Collisions are written to output

Construct Cell ID Array One thread per body 256 threads per block (due to register usage) Cell ID array Type: int2 (cellID, bodyID) Size: 8*Num_bodies Each thread… Calculates the cell ID of the center of the sphere Checks each of the 26 surrounding cells to see if this body is within those cells Loads this data into cell ID array in appropriate place Each block… Sums the number of entries added, writes to another array A parallel scan is used to find the total number of entries added

Sort Cell ID Array Radix Sort Sorts cell IDs in several passes Sorts low order bits before higher order bits, retaining order of IDs with same cell ID This helps in a later step Takes 4 passes to sort the 32 bit (4 byte) integers Makes use of parallel scan operation Used radix sort from ‘particles’ project in SDK Future: use radix sort of O(N) Elements not set in the previous step are sorted to the end of the array

Create Collision Cell List One thread per item in cell ID array 256 threads per block Each thread Gets the cell ID for this index (cellid1) Gets the cell ID for the previous index (cellid2) If the cell IDs are different, writes the index to another array at location cellid1

Traverse Collision Cell List One thread per collision cell Allocate space for collision data Assume some constant maximum number of contacts per cell (for example 20) Each thread… Identifies the number of bodies in its cell Performs an exhaustive search for collisions between these bodies Saves any collision data to the appropriate location Collision data can then be sorted, etc in post-processing

Remaining Issues Two bodies may both be in cell A and in cell B This collision would be found twice, so it must be associated with a single cell Solution: Find collision points on the two bodies, associate the collision with whichever cell the midpoint falls in BUT: This is not working for all cases

Preliminary Results Bodies CUDA contacts found Bullet contacts found CUDA time Bullet time Speedup 1024 335 0.032 0.000 2048 712 0.047 1.01E-09 4096 1463 1464 0.015 0.319 8192 2966 2969 0.016 0.340 16384 5807 5811 0.063 0.062 0.984 32768 11659 11672 0.094 0.156 1.660 65536 23467 23479 0.141 0.391 2.773 131072 48955 49004 0.25 0.937 3.748 262144 105819 105920 0.469 2.359 5.030 524288 190552 190747 0.875 5.172 5.911

Preliminary Results

Collision Detection: Design & Results Brandon Smith November 18, 2008 ME 964

contact_data Allocation Possible ways to allocate the contactdata array: Allocate contactdata[ N(N-1)/2 ] Allocate contactdata[ n_contacts ] To avoid creating a huge array, I chose the second method: 1st Kernel Call Find the number of contacts. 2nd Kernel Call Calculate the contactdata for each contact. contactdata is sorted using qsort after being calculated.

Program Layout C++ main { cudaCollisions { __global__ find_collisions<<< >>> { __device__ test() } malloc contactdata

Kernel Call Setup SIMD pattern is implied. Body 1 2 3 i 4 7 5 8 6 9 k j SIMD pattern is implied. The total number of contact tests is: n_tests = N(N-1)/2 Kernel is called once per body. Blocks of 256 threads are used. 3 blocks x 256 threads = full SM (768 threads) At least i threads are spawned. In block increments In the kernel the j index is found. j = bx*THREADS_PER_BLOCK + tx If (i==0 && j==0) the contact counter (global variable) is set to zero. A device function is called for the actual test. Pass in: i, j, x, y, z, r, contactdata, n_contacts, pass

Device Function Variables are stored shared or local memory, not global: Thread 0 saves variables of i to shared memory because every thread uses these: xi,yi,zi,ri Threads are synchronized. Each thread keeps different local variables for j: xi,yj,zj,rj Using local and shared memory increased speed 20x. Distance between centers is calculated for collision check. In the first pass it simply tests for contact. In the second pass it proceeds to calculate contactdata. atomicAdd is used to count the number of contacts Keeps one contact tally for all threads No need for condensation of results from each thread Custom build step: nvcc.exe -ccbin "C:\Program Files\Microsoft Visual Studio 8\VC\bin" -c -arch sm_11 -D_CONSOLE -Xcompiler "/EHsc /W3 /nologo /Wp64 /O2 /Zi /MT " - I"C:\CUDA\include" -I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" -o Release\collide.obj collide.cu Normal unit vector and contact positions are calculated.

Not everything went smoothly: Bullet ordered Id’s inconsistently. Changed Bullet source Validation failed after ~250k bodies because CUDA differed from Bullet by 0.1. Changed Bullet source to 0.11 Debugging is difficult if errors only occur in “Release” mode. Can’t use the debugger or print statements

Timing Results Passed the Bullet test. Bodies CUDA n2 Bullet nlog(n) 1024 0.11 0.000524 0.047 0.015413 4096 0.25 0.008389 0.078 0.073981 16384 0.906 0.134218 0.281 0.345245 65536 3.875 2.147484 1.156 1.578264 262144 35.828 34.35974 6.234 7.102189 1048576 511 549.7558 - 31.56528 Passed the Bullet test. For large n, Bullet seems to scale as nlog(n) while CUDA scales as n2. Bullet took too long for 1M bodies due to memory swapping on the hard drive.

Collision Detection Midterm Presentation Makarand Datar

Methodology: Brute force technique Upper Triangle: Check for contacts N

Methodology: Brute force technique Memory Usage Diagonal entries loaded in shared memory Size: N NumberOfBlocks*BLOCK_SIZE = N Entries to the right of each diagonal entry are read from global Disadvantages Number of bodies must be multiples of BLOCK_SIZE Lot of global memory accesses

Methodology: Brute force technique Setbacks Time out error beyond 216 = 65536 Works in EmuDebug mode Another approach tried An array of length equal to number of bodies was initialized to zeros on the host and was copied on the device Values incremented if contact is detected Copied back to host and scan was performed Resulted in same error

CUDA and Bullet Timings Number of Bodies Bullet times CUDA Times 1024 2048 4096 8192 16384 32768 65536 15 16 31 47 110 250 77.814766000 90.596352000 148.17851300 299.69787600 952.95739700 3496.0615230 13392.455078

Bullet: Scaling of time

CUDA: Scaling of time

Chapter 5: Patterns I think following patterns were used in my design Programming Structure: “Single Program Multiple Data” (SPMD) Data Structure Pattern: “Distributed array pattern” Used when a single array needs to be partitioned between multiple UEs

Broad phase collision detection with CUDA Justin Madsen ME 964 midterm project

Outline Change of plans Spatial subdivision Algorithm Complications of parallel implementation CUDA Implementation Initialize and build Cell ID array Sort Cell ID array Collision Cell array Plans for final project

Spatial subdivision Algorithm Divide collision space into ‘boxes’, edge length = diameter of largest sphere For each object, determine which ‘boxes’ the centroid & volume occupy Check for collision if: Two objects are present in a box, and At least 1 object centroid present in box, and Collision has not already been checked

Complications of parallel version Threads process many ‘boxes’ in parallel Don’t want multiple threads updating object at the same time Box edge length ≥ object diameter Thus maximum of 8 boxes occupied by a single object Do 8 passes of spatial boxes such that adjacent ‘boxes’ are updated in different passes Don’t want to evaluate same collision in multiple passes Keep track of ‘box’ occupied by each object’s centroid

Cell ID array Initialize spatial grid ‘box’ edge length, # of boxes in x,y,z directions sequential Determine which ‘boxes’ each object intersects. Create hash value for ‘box’ IDs 1 thread per object Up to 23 = 8 cell IDs per object Task parallel pattern Total number of cell IDs is found Inclusive scan Recursive data pattern

Sort Cell ID array Not every object intersects 8 ‘boxes’ Lots of empty space in Cell ID array Consolidate non-empty cell IDs to front of array with Radix sort Bulk of non-collisions are weeded out here Bit-wise, stable sorting algorithm Phase 1: Radix counters  task parallel Phase 2: Prefix sum  data parallel Phase 3: Reordering  task parallel

Collision cell array Sorted Cell ID array is scanned for possible collisions for narrow phase processing Arrayed is scanned twice First scan: count number of possible collisions Exclusive scan to find array offsets Second scan: create collision cell arrays Once collision cell arrays are created, assign one thread per array to evaluate narrow phase collisions Easy for spheres (compare centroids and radii)

Final Project Plans Collision detection to be used for granular dynamics Use existing algorithms to determine dynamics of a system with many contacts Integrate my collision detection program into existing software Bullet or Chrono::Engine Extra work to be done Collision detection program must be able to iterate over time i.e.) be executed multiple times Be able to compute contact forces

ME 964 – Midterm Project Saigopal Nelaturi

Approach Max number of collisions = Use threads on the GPU For bodies, number of threads = 549755289600 Max number of threads on GTX 280 ≈ 2048x using 2d blocks in a 2d grid SIMD paradigm

Algorithm All possible collisions can be characterized as binary entries in a 2-d array Entry indices shown below – totally Use this indexing scheme to determine a bijection

Bijective correspondence Given an entry id , find a map Consider first

Bijective correspondence In general Always works – example

Implementation Allocate device memory – For bodies – kernel will not execute!! Otherwise – compute from block, thread id Apply bijection, compute collision, write into global memory using atomic add – works well but inefficient

Complexity

Workarounds Assume number of collisions bounded by 5*n Do not rely on atomicAdd() – use vector reduction Others …

Midterm Project Ram Subramanian

The Task To solve a collision detection problem: Given an arbitrary number of rigid spheres with known radii, distributed in the 3D space, To find out which spheres are in contact/penetration with which other spheres.

Indexing Scheme 1 Every Thread gets a Reference body (Body A) and a Comparison body (Body B). Each block has 512 threads (assumption 1). Each row in a grid has 512 blocks (assumption 2). Total number of threads is n(n-1)/2. Compute the index value with the thread ID and block ID. Using this index value and the number of bodies (using the div and mod) the index of the Body A and Body B, respectively, can be determined.

Indexing Scheme 2 Loop over every body in the list Each kernel call passes body A and the list of bodies with id value greater than A Every iteration of the loop we have fewer threads. (n-1, n-2, n-3 … 1)

Indexing Scheme 3 We launch n*(n-1)/2 threads with an indexing scheme that captures all the relevant test cases without repetitions. Knowing we have n*(n-1)/2 elements we can compute the list of bodies A and B each element. A = floor(0.5 + sqrt(2*(float)n +0.25)); B = n - (j *(j-1)/2);

Problems Noted atomicAdd Not enough memory Launching blocks and threads Integer overflow emuDebug vs Debug

Note on atomicAdd() Test Code Run with different number of bodies n=512 upto 16,384 . Number of threads = n*(n-1)/2 Problems occur at 32,768. Number of threads = 536854528 When the number of threads is around 268697600, atomicAdd cannot be trusted.

Not enough memory When n = 32,768 bodies n*(n-1)/2 = 536854528 256 MB When allocation space for n*(n-1)/2 When n = 32,768 bodies n*(n-1)/2 = 536854528 Required memory = 2,147,418,112 = 2 GB Available on 8600 GTS 268,107,776 = 256 MB

Launching blocks and threads Choice of block size and thread size really affects the running. Computing size of grid Check round off errors when using integer arithmetic

Integer overflow When using integer arithmetic it will overflow at 2,147,483,648 (231). But when using n*(n-1)/2 with n > 65,536, values in an intermediate part of the calculation can reach 2,147,450,880 causing an overflow. int numBlocks= (numBodies*(numBodies-1)/2) / blockSize ; int numBlocks = ( (numBodies/ blockSize)*(numBodies-1)/2 );

in the end….fixing up… Even after each bug was fixed Replaced most atomicAdd call with a vector reduction. (Contention reduced to about 5, at most ‘n’) Used “long int”. Checked intermediate integer computations Reduced the pre-allocated value of memory below the limit