Midterm Project Toby Heyn ME964 11/18/2008.

Midterm Project Toby Heyn ME964 11/18/2008

Collision Detection Goal: Given a distribution of spheres in space, determine some information about the collisions between bodies Algorithm: Spatial Subdivision

Spatial Subdivision Spatial Subdivision
Partition space into uniform grid (cells) Size of cell based on largest object For each object, determine which cells the object overlaps Objects can only collide if they occupy the same cell

Spatial Subdivision Construct Cell ID Array Sort Cell ID Array
Each thread determines the cell IDs of the cells its sphere occupies, loads into Cell ID Array Sort Cell ID Array Radix Sort Algorithm Create Collision Cell List Scan sorted Cell ID Array, look for changes in cell ID Write Collision Cell List with Cell ID Array indices, number of objects in the cell Traverse Collision Cell List One thread per Collision Cell Each thread checks all collision pairs in the Collision Cell Collisions are written to output

Construct Cell ID Array
One thread per body 256 threads per block (due to register usage) Cell ID array Type: int2 (cellID, bodyID) Size: 8*Num_bodies Each thread… Calculates the cell ID of the center of the sphere Checks each of the 26 surrounding cells to see if this body is within those cells Loads this data into cell ID array in appropriate place Each block… Sums the number of entries added, writes to another array A parallel scan is used to find the total number of entries added

Sort Cell ID Array Radix Sort
Sorts cell IDs in several passes Sorts low order bits before higher order bits, retaining order of IDs with same cell ID This helps in a later step Takes 4 passes to sort the 32 bit (4 byte) integers Makes use of parallel scan operation Used radix sort from ‘particles’ project in SDK Future: use radix sort of O(N) Elements not set in the previous step are sorted to the end of the array

Create Collision Cell List
One thread per item in cell ID array 256 threads per block Each thread Gets the cell ID for this index (cellid1) Gets the cell ID for the previous index (cellid2) If the cell IDs are different, writes the index to another array at location cellid1

Traverse Collision Cell List
One thread per collision cell Allocate space for collision data Assume some constant maximum number of contacts per cell (for example 20) Each thread… Identifies the number of bodies in its cell Performs an exhaustive search for collisions between these bodies Saves any collision data to the appropriate location Collision data can then be sorted, etc in post-processing

Remaining Issues Two bodies may both be in cell A and in cell B
This collision would be found twice, so it must be associated with a single cell Solution: Find collision points on the two bodies, associate the collision with whichever cell the midpoint falls in BUT: This is not working for all cases

Preliminary Results Bodies CUDA contacts found Bullet contacts found
CUDA time Bullet time Speedup 1024 335 0.032 0.000 2048 712 0.047 1.01E-09 4096 1463 1464 0.015 0.319 8192 2966 2969 0.016 0.340 16384 5807 5811 0.063 0.062 0.984 32768 11659 11672 0.094 0.156 1.660 65536 23467 23479 0.141 0.391 2.773 131072 48955 49004 0.25 0.937 3.748 262144 105819 105920 0.469 2.359 5.030 524288 190552 190747 0.875 5.172 5.911

Preliminary Results

Collision Detection: Design & Results
Brandon Smith November 18, 2008 ME 964

contact_data Allocation
Possible ways to allocate the contactdata array: Allocate contactdata[ N(N-1)/2 ] Allocate contactdata[ n_contacts ] To avoid creating a huge array, I chose the second method: 1st Kernel Call Find the number of contacts. 2nd Kernel Call Calculate the contactdata for each contact. contactdata is sorted using qsort after being calculated.

Program Layout C++ main { cudaCollisions {
__global__ find_collisions<<< >>> { __device__ test() } malloc contactdata

Kernel Call Setup SIMD pattern is implied.
Body 1 2 3 i 4 7 5 8 6 9 k j SIMD pattern is implied. The total number of contact tests is: n_tests = N(N-1)/2 Kernel is called once per body. Blocks of 256 threads are used. 3 blocks x 256 threads = full SM (768 threads) At least i threads are spawned. In block increments In the kernel the j index is found. j = bx*THREADS_PER_BLOCK + tx If (i==0 && j==0) the contact counter (global variable) is set to zero. A device function is called for the actual test. Pass in: i, j, x, y, z, r, contactdata, n_contacts, pass

Device Function Variables are stored shared or local memory, not global: Thread 0 saves variables of i to shared memory because every thread uses these: xi,yi,zi,ri Threads are synchronized. Each thread keeps different local variables for j: xi,yj,zj,rj Using local and shared memory increased speed 20x. Distance between centers is calculated for collision check. In the first pass it simply tests for contact. In the second pass it proceeds to calculate contactdata. atomicAdd is used to count the number of contacts Keeps one contact tally for all threads No need for condensation of results from each thread Custom build step: nvcc.exe -ccbin "C:\Program Files\Microsoft Visual Studio 8\VC\bin" -c -arch sm_11 -D_CONSOLE -Xcompiler "/EHsc /W3 /nologo /Wp64 /O2 /Zi /MT " - I"C:\CUDA\include" -I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" -o Release\collide.obj collide.cu Normal unit vector and contact positions are calculated.

Not everything went smoothly:
Bullet ordered Id’s inconsistently. Changed Bullet source Validation failed after ~250k bodies because CUDA differed from Bullet by 0.1. Changed Bullet source to 0.11 Debugging is difficult if errors only occur in “Release” mode. Can’t use the debugger or print statements

Timing Results Passed the Bullet test.
Bodies CUDA n2 Bullet nlog(n) 1024 0.11 0.047 4096 0.25 0.078 16384 0.906 0.281 65536 3.875 1.156 262144 35.828 6.234 511 - Passed the Bullet test. For large n, Bullet seems to scale as nlog(n) while CUDA scales as n2. Bullet took too long for 1M bodies due to memory swapping on the hard drive.

Collision Detection Midterm Presentation
Makarand Datar

Methodology: Brute force technique
Upper Triangle: Check for contacts N

Memory Usage Diagonal entries loaded in shared memory Size: N NumberOfBlocks*BLOCK_SIZE = N Entries to the right of each diagonal entry are read from global Disadvantages Number of bodies must be multiples of BLOCK_SIZE Lot of global memory accesses

Setbacks Time out error beyond 216 = 65536 Works in EmuDebug mode Another approach tried An array of length equal to number of bodies was initialized to zeros on the host and was copied on the device Values incremented if contact is detected Copied back to host and scan was performed Resulted in same error

CUDA and Bullet Timings
Number of Bodies Bullet times CUDA Times 1024 2048 4096 8192 16384 32768 65536 15 16 31 47 110 250

Bullet: Scaling of time

CUDA: Scaling of time

Chapter 5: Patterns I think following patterns were used in my design
Programming Structure: “Single Program Multiple Data” (SPMD) Data Structure Pattern: “Distributed array pattern” Used when a single array needs to be partitioned between multiple UEs

Broad phase collision detection with CUDA
Justin Madsen ME 964 midterm project

Outline Change of plans Spatial subdivision Algorithm
Complications of parallel implementation CUDA Implementation Initialize and build Cell ID array Sort Cell ID array Collision Cell array Plans for final project

Spatial subdivision Algorithm
Divide collision space into ‘boxes’, edge length = diameter of largest sphere For each object, determine which ‘boxes’ the centroid & volume occupy Check for collision if: Two objects are present in a box, and At least 1 object centroid present in box, and Collision has not already been checked

Complications of parallel version
Threads process many ‘boxes’ in parallel Don’t want multiple threads updating object at the same time Box edge length ≥ object diameter Thus maximum of 8 boxes occupied by a single object Do 8 passes of spatial boxes such that adjacent ‘boxes’ are updated in different passes Don’t want to evaluate same collision in multiple passes Keep track of ‘box’ occupied by each object’s centroid

Cell ID array Initialize spatial grid
‘box’ edge length, # of boxes in x,y,z directions sequential Determine which ‘boxes’ each object intersects. Create hash value for ‘box’ IDs 1 thread per object Up to 23 = 8 cell IDs per object Task parallel pattern Total number of cell IDs is found Inclusive scan Recursive data pattern

Sort Cell ID array Not every object intersects 8 ‘boxes’
Lots of empty space in Cell ID array Consolidate non-empty cell IDs to front of array with Radix sort Bulk of non-collisions are weeded out here Bit-wise, stable sorting algorithm Phase 1: Radix counters  task parallel Phase 2: Prefix sum  data parallel Phase 3: Reordering  task parallel

Collision cell array Sorted Cell ID array is scanned for possible collisions for narrow phase processing Arrayed is scanned twice First scan: count number of possible collisions Exclusive scan to find array offsets Second scan: create collision cell arrays Once collision cell arrays are created, assign one thread per array to evaluate narrow phase collisions Easy for spheres (compare centroids and radii)

Final Project Plans Collision detection to be used for granular dynamics Use existing algorithms to determine dynamics of a system with many contacts Integrate my collision detection program into existing software Bullet or Chrono::Engine Extra work to be done Collision detection program must be able to iterate over time i.e.) be executed multiple times Be able to compute contact forces

ME 964 – Midterm Project Saigopal Nelaturi

Approach Max number of collisions = Use threads on the GPU
For bodies, number of threads = Max number of threads on GTX 280 ≈ 2048x using 2d blocks in a 2d grid SIMD paradigm

Algorithm All possible collisions can be characterized as binary entries in a 2-d array Entry indices shown below – totally Use this indexing scheme to determine a bijection

Bijective correspondence
Given an entry id , find a map Consider first

Bijective correspondence
In general Always works – example

Implementation Allocate device memory –
For bodies – kernel will not execute!! Otherwise – compute from block, thread id Apply bijection, compute collision, write into global memory using atomic add – works well but inefficient

Complexity

Workarounds Assume number of collisions bounded by 5*n
Do not rely on atomicAdd() – use vector reduction Others …

Midterm Project Ram Subramanian

The Task To solve a collision detection problem: Given an arbitrary number of rigid spheres with known radii, distributed in the 3D space, To find out which spheres are in contact/penetration with which other spheres.

Indexing Scheme 1 Every Thread gets a Reference body (Body A) and a Comparison body (Body B). Each block has 512 threads (assumption 1). Each row in a grid has 512 blocks (assumption 2). Total number of threads is n(n-1)/2. Compute the index value with the thread ID and block ID. Using this index value and the number of bodies (using the div and mod) the index of the Body A and Body B, respectively, can be determined.

Indexing Scheme 2 Loop over every body in the list
Each kernel call passes body A and the list of bodies with id value greater than A Every iteration of the loop we have fewer threads. (n-1, n-2, n-3 … 1)

Indexing Scheme 3 We launch n*(n-1)/2 threads with an indexing scheme that captures all the relevant test cases without repetitions. Knowing we have n*(n-1)/2 elements we can compute the list of bodies A and B each element. A = floor(0.5 + sqrt(2*(float)n +0.25)); B = n - (j *(j-1)/2);

Problems Noted atomicAdd Not enough memory
Launching blocks and threads Integer overflow emuDebug vs Debug

Note on atomicAdd() Test Code
Run with different number of bodies n=512 upto 16,384 . Number of threads = n*(n-1)/2 Problems occur at 32,768. Number of threads = When the number of threads is around , atomicAdd cannot be trusted.

Not enough memory When n = 32,768 bodies n*(n-1)/2 = 536854528
256 MB When allocation space for n*(n-1)/2 When n = 32,768 bodies n*(n-1)/2 = Required memory = 2,147,418,112 = 2 GB Available on 8600 GTS 268,107,776 = 256 MB

Launching blocks and threads
Choice of block size and thread size really affects the running. Computing size of grid Check round off errors when using integer arithmetic

Integer overflow When using integer arithmetic it will overflow at 2,147,483,648 (231). But when using n*(n-1)/2 with n > 65,536, values in an intermediate part of the calculation can reach 2,147,450,880 causing an overflow. int numBlocks= (numBodies*(numBodies-1)/2) / blockSize ; int numBlocks = ( (numBodies/ blockSize)*(numBodies-1)/2 );

in the end….fixing up… Even after each bug was fixed
Replaced most atomicAdd call with a vector reduction. (Contention reduced to about 5, at most ‘n’) Used “long int”. Checked intermediate integer computations Reduced the pre-allocated value of memory below the limit

Midterm Project Toby Heyn ME964 11/18/2008.

Similar presentations

Presentation on theme: "Midterm Project Toby Heyn ME964 11/18/2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Midterm Project Toby Heyn ME964 11/18/2008.

Similar presentations

Presentation on theme: "Midterm Project Toby Heyn ME964 11/18/2008."— Presentation transcript:

Similar presentations

About project

Feedback