GPU-based Parallel Collision Detection for Real-time Motion Planning

GPU-based Parallel Collision Detection for Real-time Motion Planning
Jia Pan and Dinesh Manocha University of North Carolina, Chapel Hill, USA Presenter: Liangjun Zhang, Stanford University

Real-time Motion Planning
Dynamic/uncertain/deformable environments Complex task execution often needs real-time re-planning High-level task planning physical robots Real time motion planning is important for a physical robot such as PR-2. When robots operates in dynamic, uncertain or deformable environments, they will sense the environment and plan then execute repeatedly and in real-time. Also, Complex task execution and high level task planning also need real-time planning and replanning.

Research Theme: Parallel Computation
Randomized motion planner is parallelizable Proximity queries/collision detection can take more than 90% overall time Processors are becoming parallel (the "New Moore's Law"). CPU->Multi-core (4-8 cores) GPU->Hundreds of cores E.g. PR-2 has multi-core CPU and GPU Parallel algorithm design is necessary to take full advantage of commodity hardware There are several reasons why we focus on parallel algorithms for real-time motion planning. 2. Modern processors are becoming more and more parallel

Main Results An efficient parallel collision detection algorithm for real-time sampling-based motion planning using GPUs About 10X faster than prior GPU-based collision detection algorithms The algorithm is designed specifically for the architecture of GPU The overall motion planner is X faster than CPU-based planners for rigid and articulated models The main result of this work is an efficient parallel collision detection algorithm using GPU that performs thousands of collision detection simultaneously. This can be very useful for real-time randomized motion planners. Since our algorithm is designed specifically for the architecture of GPU, we can achieve 10 times speed up than prior GPU-based collision detection algorithms. The overall motion planning is 50 to 100 times after than CPU-based planners for rigid and XXx robots.

Outline Parallel Algorithms and GPUs Real-Time Planning Framework
Parallel Collision Detection Implementation and Results Conclusions The structure of my talk is as following: I first briefly introduce this work. I will then talk about the parallel algorithms and why use GPU. I will talk about a real-time planning framework based on randomized sampling. A major component of this real-time planning framework is the parallel collision detection proposed in this paper. Finally, I will talk about the implementation, show results and draw conclusions.

Why GPUs? GPUs can be faster/cheaper/smaller over CPU
Latest NVIDIA Fermi GPU ($400) can provide another 2-3 times speed-up

GPU Architecture Many-core programmable processors Main memory
High number of independent cores ( ) Wide vector units on each core (8 - 32) Hundreds of threads Main memory High bandwidth, but high latency Synchronization between cores Only via main memory No memory consistency between the cores GPU programmable model is different from CPU

GPU Architecture The threads on GPU is organized in a hierarchical way
The number of parallelizable blocks is restricted by the shared memory used per block Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Block (1, 0)‏ Host

GPGPU: GPUs for non-graphics applications
GPU are widely used to speed-up general purpose large-scale computations … Numerical linear algebra Sorting [Owens et al. 2008] Fourier Transforms [Leischener et al. 2009] Acoustic Wave Equation [Mehra et al. 2010] Delayed Duplicate Detection for memory management [Edelkamp et al. 2010] Search [Joseph Kider et al. 2010] Database query processing [Govindaraju et al. 2004; He et al. 2009] Low degree-of-freedom motion planning [Hoff et al. 2000; Pisula et al. 2000] Motion planning for high DOF robots So GPU is a special architecture originally designed for stream operations like rendering. But days now GPU is used more and more in general purpose computation

Previous Work: Parallel Motion Planning
Multi-core or multi-CPU Potential field [Barraquand et al. 1991][Challou et al. 2003][Gini 1996] PRM [Plaku et al. 2005,2007][Amato et al. 1999] RRT [Carpin et al. 2001] GPU Parallel roadmap search [Kider et al. 2010] Low-dof motion planning based on structures like Voronoi graph [Lengyel et al. 1990][Hoff et al. 2000][Sud et al. 2007] [Pisula et al. 2000][Foskey et al. 2001]

Previous Work: Parallel Collision Detection
Bounding volume hierarchy (BVH) based Multi-core CPU [Tang et al. 2010] GPU [Lauterbach et al. 2009] Hybrid (Multi-core CPU + GPU) [Kim et al. 2009] Other acceleration structures Spatial hashing [Alcantara et al. 2009] Focus on a single collision query; randomized planners perform high number (>10000) of queries

Parallel Collision Detection Implementation and Results Conclusions The structure of my talk is as following: I first briefly introduce this work. I will then talk about the parallel algorithms and why use GPU. I will talk about a real-time planning framework based on randomized sampling. A major component of this real-time planning framework is the parallel collision detection proposed in this paper. Finally, I will talk about the implementation, show results and draw conclusions.

G-Planner: Real-time Planning using GPUs [Pan et al. 2010a]
G-Planner uses probabilistic roadmap method (PRM) as the underlying motion planning algorithm High-DOF robots Single query (lazy PRM) or multiple queries Being extended to handle uncertainty

G-Planner Architecture

Challenge for Real-time Planner on GPUs
Proximity queries can take more than 90% overall time 1. High number of collision queries Compute milestones and local planning May need to perform 100,000 queries for many benchmarks 2. K-nearest neighbor query Expensive when number of samples is large

Challenge for Real-time Planner on GPUs
Architecture restrictions GPU is not an ideal Parallel Random Access Machine (PRAM) Processor Parallel planning algorithms designed for multi-core or multiple CPUs do not map well to GPU architectures

Parallel Collision Detection Implementation and Results Conclusions

Overlapping test of BVs
PQP: A Bounding Volume Hierarchy (BVH) based Collision Detection Algorithm Object 2 Object 1 A D  B C E F Traverse the bounding volume test tree (BVTT) Most efficient collision detection algorithms are based on bounding volume data structure. Here is the BVH for a bunny. The root BV encloses the entire bunny. The bunny is split and smaller BV are used. In these way, we can get a binary tree structure. For two objects 1 and 2 with BVHs, we can perform collision detection efficiently. We traverse the bounding volume test tree. First, we test the whether the BVs for roots of the two trees are over Overlapping test of BVs BF CF AD AE AF BE CE

GPU Architecture Restrictions
Basic parallel algorithm (per-thread per collision detection) is not suitable for GPU architectures Regular memory access is explained in 5 pages later Coherent program branch is explained in 4 pages later … Thread 1 for q1 Thread 2 for q2 Thread 3 for q3 Thread 4 for q4

GPU Memory Model Shared memory is fast, BUT limited (16K-48K)
The more shared memory used for one block, the less parallelism Parallel block num ≤ Basic parallel BVH algorithm needs one stack (>32) for each thread, so 32 * n for a n-thread block (BAD)

Data-dependent Conditional Branch
(Per warp) GPU cannot handle data-dependent conditional branch efficiently. Only one branch can execute at one time. The threads choose the other branch have to stop and wait. Unfortunately, BVH traverse happens frequently Happens frequently in BVH traverse (BAD)

Uncoalesced Memory Access
GPU prefer regular memory BVH traversal results in uncoalesced memory access (BAD)

Our Solutions Parallel Collision-Packet Traversal
50%-100% speed up over basic GPU method Simple to implement and can be used with basic parallel collision algorithms Parallel Collision Query with Workload Balancing 5-10x speed up over basic GPU method More complicated to implement Speed ups over what?

Parallel Collision-Packet Traversal
Cluster collision queries into several groups “grouping nearby samples” Groups are further divided into small warps For queries in the same warp, traverse the BVTT in the same order One stack per block (GOOD!) Coalesced memory access and cacheable (GOOD!) No branch divergence (GOOD!)

Query Clustering Find clusters to minimize
where are cluster centers and clusters are of the size Constrained clustering, difficult to solve We only approximate it with k-means and then divide into chunk-size clusters.

Packet’s Traverse Order
We need an optimal traverse order for the packet to avoid additional BV collisions. Greedy heuristics Define the probability of collision of one traverse order P is where Traverse the children node with larger first

Disadvantages There are unnecessary BV overlapping tests
Require good clustering algorithm Observation The task grain for each thread is still too large (traverse one BVTT) What is task grain?

Parallel Collision Query with Workload Balancing
Each thread executes more fine-grained tasks: Task Overlapping test between two BVs Triangle intersection test Global Queue AD AE AF To address this issue, we come up the second parallel collision detection algorithm. Each thread now execute more fine-grained tasks. Each task only performs a BV overlap test or triangle intersection. CE BE CF BF AD AE AF CE BE CF BF

Workload Queue All tasks are stored in a global queue
Each block (core) keeps a local task queue 0,0,1 0,0,2 0,0,3 0,0,N q1 q2 qN 1,2 2,2 0,0 0,1 0,2 1,1 2,1 1,2 2,2 0,0 0,1 0,2 1,1 2,1 1,2 2,2 0,0 0,1 0,2 1,1 2,1 0,0,1 0,0,2 0,0,n1 0,0, n1+1 0,0, n1+2 0,0,n2 0,0, nQ-1+1 0,0, nQ-1+2 0,0,M Local queue for core 1 Local queue for core 2 Local queue for core Q GPU traverse X,X,X X,X,X X,X,X X,X,X X,X,X X,X,X X,X,X X,X,X X,X,X X,X,X Local queue for core 1 Local queue for core 2 Local queue for core Q

Workload Balancing Different collision queries have different number of BV overlapping tests Different local queues will have different number of tasks Nearly full  GPU core is busy Nearly empty GPU core is idle Last bullet on “too full or too empty” is unclear.

task kernel manage kernel balance kernel pump kernel Task 0 Task k
Core 1 Core k Core n Task 0 Task k Task n …… …… … … … task kernel Task i Task k+i Task n+i abort or continue abort or continue abort or continue Utilization manage kernel External Task Pools Global Task Pools full …… empty balance kernel Global Task Pools full …… pump kernel empty

Advantages Branch and un-coalesce cases are minimized (GOOD!)
No need for stack (GOOD!) Use all threads in GPU

Performance Analysis We can prove that our parallel algorithms on GPU are work efficient, i.e. not slower than the serial implementation Tserial > Tbasic > Tpacket > Tworkload ≈ Tserial/#processor Tserial Tbasic Tpacket Tworkload

Parallel Collision Detection Implementation and Results Conclusions

Implementation Implementation on CUDA All tests were on
Intel Core i7 3.2GHz CPU, 6G memory NVIDIA GTX 480 GPU, 1G video memory

Benchmarks 6 DOF 6 DOF 12 DOF 38 DOF

Timing Results Comparison with basic GPU method (per-thread per collision query method) 50,000 collision queries # faces of robot #faces of obstacles Basic GPU algorithm (ms) Collision-packet algorithm (ms) Workload balancing algorithm Piano 6,540 648 224 130 3.7 Large-piano 34,880 13,824 710 529 15.1 Helicopter 3,612 2,840 272 226 2.3 Humanoid 27,749 3,495 2,316 1,823 126

Timing Results Comparison with basic GPU method (per-thread per query method) Local Planning Basic GPU algorithm (ms) Collision-packet algorithm (ms) Workload balancing algorithm (ms) Piano 2,076 1,344 34 Large-piano 7,587 6,091 66 Helicopter 7,413 4,645 41 Humanoid 8,650 8,837 1,964

Overall Performance Our parallel GPU-based algorithms can perform about 500K collision queries per second on $400 NVIDIA Fermi Card (10-50X faster than prior methods)

Applications to PR2 Model
Comparison with CPU-based algorithm CPU (ms) GPU (ms) Milestone Comp. 15,952 392 Local Planning (include self-collision) 643,194 6,803

Results ~300ms for 500 samples Video

Conclusions An efficient GPU-based parallel collision detection algorithm Real-time motion planning is possible by using GPUs But we need carefully design algorithms that map well to GPU architectures

Ongoing and Future Work
B-spline based path smoothing Planning with environment uncertainty Probabilistic collision checking Integration with physical robots (e.g. PR2)

Acknowledgements Funding agencies ARO NSF DRAPA/RDECOM Intel

Questions?

GPU-based Parallel Collision Detection for Real-time Motion Planning

Similar presentations

Presentation on theme: "GPU-based Parallel Collision Detection for Real-time Motion Planning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU-based Parallel Collision Detection for Real-time Motion Planning

Similar presentations

Presentation on theme: "GPU-based Parallel Collision Detection for Real-time Motion Planning"— Presentation transcript:

Similar presentations

About project

Feedback