Download presentation
Presentation is loading. Please wait.
Published byShannon Hart Modified over 9 years ago
1
Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer Science, CUNY Graduate Center 3 School of Computer Science, the University of Oklahoma CISE/IIS Medium Collaborative Research Grants 1302423/1302439: “Spatial Data and Trajectory Data Management on GPUs”
2
Outline Introduction & Background Application: Large-Scale Biodiversity Data Management Data Parallel Designs and Implementations Polygon Decomposition Quadtree Construction Spatial Query Processing Experiments Summary and Future Work
3
Parallel Computing – Hardware A B C Thread Block CPU Host (CMP) Core Local Cache Shared Cache DRAM Disk SSD GPU SIMD PCI-E Ring Bus Local Cache Core... Core GDRAM Core... MIC PCI-E T0T0 T1T1 T2T2 T3T3 4-Threads In-Order 16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9994
4
ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets) Feb. 2013 7.1 billion transistors (551mm²) 2,688 processors 4.5 TFLOPS SP and 1.3 TFLOPS DP Max bandwidth 288.4 GB/s PCI-E peripheral device 250 W (17.98 GFLOPS/W -SP) Suggested retail price: $999 What can we do today using a device that is more powerful than ASCI Red 17 years ago?
5
GeoTECI@CCNY CCNY Computer Science LAN Microway Dual 8-core 128GB memory Nvidia GTX Titan Intel Xeon Phi 3120A 8 TB storage DIY *2 SGI Octane III Dual Quadcore 48GB memory Nvidia C2050*2 8 TB storage Dual-core 8GB memory Nvidia GTX Titan 3 TB storage Dell T5400 Dual Quadcore 16GB memory Nvidia Quadro 6000 1.5 TB storage Lenovo T400s Dell T7500 Dual 6-core 24 GB memory Nvidia Quadro 6000 Dell T7500 Dual 6-core 24 GB memory Nvidia GTX 480 Dual Quadcore 16GB memory Nvidia FX3700*2 Dell T5400 DIY Quadcore (Haswell) 16 GB memory AMD/ATI 7970 Quadcore 8 GB memory Nvidia Quadro 5000m HP 8740w CUNY HPCC KVM “Brawny” GPU cluster “Wimmy” GPU cluster Web Server/ Linux App Server Windows App Server...building a highly-configurable experimental computing environment for innovative BigData technologies…
6
Computer Architecture Spatial Data Management How to fill the big gap effectively? David Wentzlaff, “Computer Architecture”, Princeton University Course on Coursea
7
Parallel Computing– Languages & Libraries http://www.macs.hw.ac.uk/cs/techreps/docs/files/HW-MACS-TR-0103.pdf Thrust Bolt CUDPP boost GNU Parallel Mode
8
Source: http://parallelbook.com/sites/parallelbook.com/files/SC11_20111113_Intel_McCool_Robison_Reinders.pptx Data Parallelisms Parallel Primitives Parallel libraries Parallel hardware
9
Outline Introduction & Background Application: Large-Scale Biodiversity Data Management Data Parallel Designs and Implementations Polygon Decomposition Quadtree Construction Spatial Query Processing Experiments Conclusions and Future Work
10
Managing Large-Scale Biodiversity Data SELECT aoi_id, sp_id, sum (ST_area (inter_geom)) FROM ( SELECT aoi_id, sp_id, ST_Intersection (sp_geom, qw_geom) AS inter_geom FROM SP_TB, QW_TB WHERE ST_Intersects (sp_geometry, qw_geom) ) GROUP BY aoi_id, sp_id HAVING sum(ST_area(inter_geom)) >T; http://geoteci.engr.ccny.cuny.edu/birds30s/BirdsQuest.html
11
Indexing “Complex” Polygons http://en.wikipedia.org/wiki/Simple_polygon http://en.wikipedia.org/wiki/Simple_Features Problems in indexing MBRs: Inexpensive yet inaccurate approximation for complex polygons Low pruning power when polygons are highly overlapped “Complex” Polygons: Polygons with multiple rings (with holes) Highly overlapped
12
http://xlinux.nist.gov/dads/HTML/linearquadtr.html Indexing “Complex” Polygons (Zhang et al 2009) (Zhang 2012) Hours of runtimes on birds range maps by extending GDAL/OGR (serial) Fang et al 2008. Spatial indexing in Microsoft SQL Server 2008. (SIGMOD’08) Using B-Tree to index quadrants, but it is unclear how the quadrants are derived from polygons
13
Outline Introduction & Background Application: Large-Scale Biodiversity Data Management Data Parallel Designs and Implementations Polygon Decomposition Quadtree Construction Spatial Query Processing Experiments Conclusions and Future Work
14
Parallel Quadtree Construction Parallel Query Processing DFS BFS
15
Parallel Polygon Decomposition
16
(0,2) (1,3)
17
(1)All operations are data parallel at the quadrant level; (2)Quadrants may be at different levels and come from same or different polygons; (3)Each GPU thread process a quadrant; (4)Accesses to GPU memory can be coalesced for neighboring quadrants from same polygons Observations :
18
Quadtree Construction Spatial Query Processing
19
Experiment Setup Species distribution data 4062 bird species in the West Hemisphere 708,509 polygons 77,699,991 vertices Polygon Group num of vertices range total num of polygons total num of points 1 10-100497,55911,961,389 2 100-1,00033,3748,652,278 3 1,000-10,0006,71920,436,931 4 10,000-100,0001,21333,336,083 Dual 8-core Sandy Bridge CPU (2.60G) 128GB memory Nvidia GTX Titan (6GB, 2688 cores) Intel Xeon Phi 3120A (6GB, 57 cores) 8 TB storage CentOS 6.4 with GCC 4.7.2, TBB 4.2, ISPC 1.6, CUDA 5.5 All vector initialization times in Thrust on GPUs are counted (new versions of Thrust allow uninitialized device vectors) Performance can vary among CUDA SDKs
20
Runtimes (Polygon Decomposition) G1 (10-100) G2 (100-1000) G3 (1000-10000) G4 (10000-100000)
21
Comparisons with PixelBox* (Polygon Decomposition) (milliseconds)10-100100-10001000-1000010000-100000 Proposed 451230346215193714 PixelBox*-shared2260157323386951686879 PixelBox*-global8666023124031948560 PixelBox*: modifying and extending the PixelBox algorithm [5] to decompose single polygons (vs. computing sizes of intersection areas of pairs of polygons) and handle “complex” multi-ring polygons PixelBox*-shared: CUDA implementation using GPU shared memory for stack DFS traversal with a batch size of N N can not be too big (shared memory capacity) or too small (GPU is underutilization if less than warp size) PixelBox*-global: CUDA implementation using GPU global memory for stack DFS traversal with different batch sizes Coalesced global GPU memory accesses are efficient Proposed technique: Thrust data parallel implementation on top of parallel primitives BFS traversal with higher degrees of parallelisms Data parallel designs (using primitives) simplify implementations GPU shared memory is not explicitly used and is more flexible Coalesced global GPU memory accesses are efficient But: large memory footprint (for the current implementation)
22
Summary and Future Work Diversified hardware makes it challenging to develop efficient parallel implementations for complex domain specific applications across platforms. The framework on data parallel designs on top of parallel primitives seems to be a viable solution in the context of managing and querying large-scale geo-referenced species distribution data Further understand the advantages and disadvantages of data parallel designs/implementations on parallel hardware (GPUs, MICs and CMPs) through domain specific applications. More efficient polygon decomposition algorithms (e.g. scanline based) using parallel primitives System integration and more applications Experiments on 4000+ birds species distribution data have shown up to 190X speedups for polygon decomposition and 27X speedups for quadtree construction over serial implementations on a high-end GPU. Comparisons with PixelBox* variations that are natively CUDA implementations have shown that efficiency and productivity can be achieved simultaneously based on the data parallel framework using parallel primitives
23
Q&A jzhang@cs.ccny.cuny.edu http://www-cs.ccny.cuny.edu/~jzhang/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.