A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science

The Traveling Salesman Problem  Common combinatorial optimization problem  Wire routing, logistics, robot arm movement, etc.  Given n cities, find shortest Hamiltonian tour  Must visit all cities exactly once and end in first city  Usually expressed as a graph problem  We use complete, undirected, planar, Euclidean graph  Vertices represent cities  Edge weights reflect distances A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

TSP Algorithm  Optimal solution is NP-hard  Heuristic algorithms used to approximate solution  We use an iterative hill climbing search algorithm  Generate k random initial tours (k climbers)  Iteratively refine them until local minimum reached  In each iteration, apply best opt-2 move  Find best pair of edges (a,b) and (c,d) such that replacing them with (a,d) → and (b,c) minimizes tour length A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

GPU Requirements  Lots of data parallelism  Need 10,000s of ‘independent’ threads  Sufficient memory access regularity  Sets of 32 threads should have ‘nice’ access patterns  Sufficient code regularity  Sets of 32 threads should follow the same control flow  Plenty of data reuse  At least O(n 2 ) operations on O(n) data A Parallel GPU Version of the Traveling Salesman Problem Thepcreport.net July 2011

TSP_GPU Implementation  Assuming 100-city problems & 100,000 climbers  Climbers are independent, can be run in parallel  Plenty of data parallelism  Potential load imbalance  Different number of steps required to reach local minimum  Every step determines best of 4851 opt-2 moves  Same control flow (but different data)  Coalesced memory access patterns  O(n 2 ) operations on O(n) data A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Code Optimizations  Key code section: finding best opt-2 move  Doubly nested loop  Only computes difference in tour length, not absolute length  Highly optimized to minimize memory accesses  “Caches” rest of data in registers  Requires only 6 clock cycles per move on a Xeon CPU core  Local minimum compared to best solution so far  Best solution updated if needed, otherwise tour is discarded  Other small optimizations (see paper) A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

GPU Optimizations  Random tours generated in parallel on GPU  Minimizes data transfer to GPU  (CPU only generates distance matrix and prints result)  2D distance matrix resident in shared memory  Ensures hits in software-controlled fast data cache  Tours copied to local memory in chunks of 1024  Enables accessing them with coalesced loads & stores A Parallel GPU Version of the Traveling Salesman Problem gamedsforum.ca July 2011

Evaluation Method  Systems  NVIDIA Tesla C2050 GPU (1.15 GHz 14 SMs w/ 32 PEs)  Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons)  Datasets  Five 100-city inputs from TSPLIB  Implementations  CUDA (GPU), Pthreads (CPU), serial C (CPU)  Use almost identical code for finding best opt-2 move A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Runtime Comparison (kroE100 Input)  GPU is 7.8x faster than CPU with 8 cores  One GPU chip is as fast as 16 or 32 CPU chips A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Speedup over Serial (kroE100 Input)  Pthreads code scales well to 32 threads (4 CPUs)  CPU performance fluctuates (NUMA), GPU stable A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Solution Quality  Optimal tour found in 4 of 5 cases with 100,000 climbers  200,000 climbers find best solution in fifth case  Runtime independent of input and linear in climbers A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

Summary  TSP_GPU source code is freely available at http://www.cs.txstate.edu/~burtscher/research/TSP_GPU/  TSP_GPU algorithm  Highly optimized implementation for GPUs  Evaluates almost 20 billion tour modifications per second on a single GPU (as fast as 32 8-core Xeons)  Produces high-quality results  May be better suited for GPU than ACO and GA algos.  Acknowledgments  NSF TeraGrid (NICS), NVIDIA Corp., and Intel Corp. A Parallel GPU Version of the Traveling Salesman ProblemJuly 2011

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Similar presentations

Presentation on theme: "A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

Similar presentations

Presentation on theme: "A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback