Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of.

Slides:



Advertisements
Similar presentations
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs Visiting Faculty: Jianting Zhang, The City College of New York.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
OpenFOAM on a GPU-based Heterogeneous Cluster
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
U 2 SOD-DB: A Database System to Manage Large-Scale Ubiquitous Urban Sensing Origin-Destination Data Jianting Zhang 134 Hongmian Gong 234 Camille Kamga.
Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1, Yin Huai 1, Rubao Lee 1, Fusheng Wang 2,3, Xiaodong Zhang 1,
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Large-scale Deep Unsupervised Learning using Graphics Processors
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11) 19 th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011 Speeding.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Parallel Programming Models
Distributed SAR Image Change Detection with OpenCL-Enabled Spark
Employing compression solutions under openacc
Enabling Effective Utilization of GPUs for Data Management Systems
Map-Scan Node Accelerator for Big-Data
Towards GPU-Accelerated Web-GIS
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Jianting Zhang City College of New York
High-Performance Analytics on Large-Scale GPS Taxi Trip Records in NYC
Outline Summary an Future Work Introduction
Prototyping A Web-based High-Performance Visual Analytics Platform for Origin-Destination Data: A Case study of NYC Taxi Trip Records Jianting Zhang1,2.
Jianting Zhang1,2 Simin You2, Le Gruenwald3
6- General Purpose GPU Programming
Jianting Zhang1,2,4, Le Gruenwald3
Presentation transcript:

Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer Science, CUNY Graduate Center 3 School of Computer Science, the University of Oklahoma CISE/IIS Medium Collaborative Research Grants / : “Spatial Data and Trajectory Data Management on GPUs”

Outline Introduction & Background Application: Large-Scale Biodiversity Data Management Data Parallel Designs and Implementations Polygon Decomposition Quadtree Construction Spatial Query Processing Experiments Summary and Future Work

Parallel Computing – Hardware A B C Thread Block CPU Host (CMP) Core Local Cache Shared Cache DRAM Disk SSD GPU SIMD PCI-E Ring Bus Local Cache Core... Core GDRAM Core... MIC PCI-E T0T0 T1T1 T2T2 T3T3 4-Threads In-Order 16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9994

ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets) Feb billion transistors (551mm²) 2,688 processors 4.5 TFLOPS SP and 1.3 TFLOPS DP Max bandwidth GB/s PCI-E peripheral device 250 W (17.98 GFLOPS/W -SP) Suggested retail price: $999 What can we do today using a device that is more powerful than ASCI Red 17 years ago?

CCNY Computer Science LAN Microway Dual 8-core 128GB memory Nvidia GTX Titan Intel Xeon Phi 3120A 8 TB storage DIY *2 SGI Octane III Dual Quadcore 48GB memory Nvidia C2050*2 8 TB storage Dual-core 8GB memory Nvidia GTX Titan 3 TB storage Dell T5400 Dual Quadcore 16GB memory Nvidia Quadro TB storage Lenovo T400s Dell T7500 Dual 6-core 24 GB memory Nvidia Quadro 6000 Dell T7500 Dual 6-core 24 GB memory Nvidia GTX 480 Dual Quadcore 16GB memory Nvidia FX3700*2 Dell T5400 DIY Quadcore (Haswell) 16 GB memory AMD/ATI 7970 Quadcore 8 GB memory Nvidia Quadro 5000m HP 8740w CUNY HPCC KVM “Brawny” GPU cluster “Wimmy” GPU cluster Web Server/ Linux App Server Windows App Server...building a highly-configurable experimental computing environment for innovative BigData technologies…

Computer Architecture Spatial Data Management How to fill the big gap effectively? David Wentzlaff, “Computer Architecture”, Princeton University Course on Coursea

Parallel Computing– Languages & Libraries Thrust Bolt CUDPP boost GNU Parallel Mode

Source: Data Parallelisms  Parallel Primitives  Parallel libraries  Parallel hardware

Outline Introduction & Background Application: Large-Scale Biodiversity Data Management Data Parallel Designs and Implementations Polygon Decomposition Quadtree Construction Spatial Query Processing Experiments Conclusions and Future Work

Managing Large-Scale Biodiversity Data SELECT aoi_id, sp_id, sum (ST_area (inter_geom)) FROM ( SELECT aoi_id, sp_id, ST_Intersection (sp_geom, qw_geom) AS inter_geom FROM SP_TB, QW_TB WHERE ST_Intersects (sp_geometry, qw_geom) ) GROUP BY aoi_id, sp_id HAVING sum(ST_area(inter_geom)) >T;

Indexing “Complex” Polygons Problems in indexing MBRs: Inexpensive yet inaccurate approximation for complex polygons Low pruning power when polygons are highly overlapped “Complex” Polygons: Polygons with multiple rings (with holes) Highly overlapped

Indexing “Complex” Polygons (Zhang et al 2009) (Zhang 2012) Hours of runtimes on birds range maps by extending GDAL/OGR (serial) Fang et al Spatial indexing in Microsoft SQL Server (SIGMOD’08) Using B-Tree to index quadrants, but it is unclear how the quadrants are derived from polygons

Outline Introduction & Background Application: Large-Scale Biodiversity Data Management Data Parallel Designs and Implementations Polygon Decomposition Quadtree Construction Spatial Query Processing Experiments Conclusions and Future Work

Parallel Quadtree Construction Parallel Query Processing DFS  BFS

Parallel Polygon Decomposition

(0,2) (1,3)

(1)All operations are data parallel at the quadrant level; (2)Quadrants may be at different levels and come from same or different polygons; (3)Each GPU thread process a quadrant; (4)Accesses to GPU memory can be coalesced for neighboring quadrants from same polygons Observations :

Quadtree Construction Spatial Query Processing

Experiment Setup Species distribution data 4062 bird species in the West Hemisphere 708,509 polygons 77,699,991 vertices Polygon Group num of vertices range total num of polygons total num of points ,55911,961, ,00033,3748,652, ,000-10,0006,71920,436, , ,0001,21333,336,083 Dual 8-core Sandy Bridge CPU (2.60G) 128GB memory Nvidia GTX Titan (6GB, 2688 cores) Intel Xeon Phi 3120A (6GB, 57 cores) 8 TB storage CentOS 6.4 with GCC 4.7.2, TBB 4.2, ISPC 1.6, CUDA 5.5 All vector initialization times in Thrust on GPUs are counted (new versions of Thrust allow uninitialized device vectors) Performance can vary among CUDA SDKs

Runtimes (Polygon Decomposition) G1 (10-100) G2 ( ) G3 ( ) G4 ( )

Comparisons with PixelBox* (Polygon Decomposition) (milliseconds) Proposed PixelBox*-shared PixelBox*-global PixelBox*: modifying and extending the PixelBox algorithm [5] to decompose single polygons (vs. computing sizes of intersection areas of pairs of polygons) and handle “complex” multi-ring polygons PixelBox*-shared: CUDA implementation using GPU shared memory for stack DFS traversal with a batch size of N N can not be too big (shared memory capacity) or too small (GPU is underutilization if less than warp size) PixelBox*-global: CUDA implementation using GPU global memory for stack DFS traversal with different batch sizes Coalesced global GPU memory accesses are efficient Proposed technique: Thrust data parallel implementation on top of parallel primitives BFS traversal with higher degrees of parallelisms Data parallel designs (using primitives) simplify implementations GPU shared memory is not explicitly used and is more flexible Coalesced global GPU memory accesses are efficient But: large memory footprint (for the current implementation)

Summary and Future Work Diversified hardware makes it challenging to develop efficient parallel implementations for complex domain specific applications across platforms. The framework on data parallel designs on top of parallel primitives seems to be a viable solution in the context of managing and querying large-scale geo-referenced species distribution data Further understand the advantages and disadvantages of data parallel designs/implementations on parallel hardware (GPUs, MICs and CMPs) through domain specific applications. More efficient polygon decomposition algorithms (e.g. scanline based) using parallel primitives System integration and more applications Experiments on birds species distribution data have shown up to 190X speedups for polygon decomposition and 27X speedups for quadtree construction over serial implementations on a high-end GPU. Comparisons with PixelBox* variations that are natively CUDA implementations have shown that efficiency and productivity can be achieved simultaneously based on the data parallel framework using parallel primitives

Q&A