Jianting Zhang1,2, Simin You2,4, Le Gruenwald3

Jianting Zhang1,2, Simin You2,4, Le Gruenwald3
Parallel Selectivity Estimation for Optimizing Multidimensional Spatial Join Processing on GPUs Jianting Zhang1,2, Simin You2,4, Le Gruenwald3 1The City College of New York 2CUNY Graduate Center 3University of Oklahoma 4Pitney Bowes Inc.

Overview Introduction & Background
Parallel Spatial Join Framework on GPUs Parallel Selectivity Estimation on GPUs Experiments and Discussions Conclusion and Future Work

Introduction NYC Taxi Trip Data Taxicabs 13,000 Medallion taxi cabs
Spatial data volumes are fast increasing due to advances of locating, sensing and simulation techniques. If you have come to NYC and hail a taxi or yellow cab, you probably have seen this. For every taxi trip, dozens of columns of attributes will be recorded, including pick up and drop-off locations and time. There are about 170 million trips in 2009 in NYC and we will be using the pickup locations in our experiments. Now many of you have a smartphone and there is a GPS embedded. Given that nearly 1.5 billion smartphones were sold in 2015, you can imagine how much spatial data has been generated every minute, every hour, day and year. Very often different spatial datasets need to be joined to derive new information and knowledge to support decision making. For example, GPS traces can be better interpreted when aligned with urban infrastructures, such as road networks and Point of Interests (POIs), through spatial joins. Introduction NYC Taxi Trip Data Taxicabs 13,000 Medallion taxi cabs License priced at $600, 000 in 2007 Car services and taxi services are separate Taxi trip records ~170 million trips (300 million passengers) in 2009 1/5 of that of subway riders and 1/3 of that of bus riders in NYC 3 3

Introduction More Spatial Data
In addition to spatial data related to location-dependent services, spatial data is virtually everywhere at multiple scales, such as cosmology, earth sciences, regional ecology, urban, all the way down to human body, organisms, cells and atoms. Spatial data processing on large-scale data, including spatial join, is challenging due to the uniqueness of spatial data when compared to relational data. Spatial data are inherently multi-dimensional and lack total ordering that preserves proximity. Geometry operators, such as point-in-polygon test, typically involve considerable floating point computation and is much more computing intensive than relational queries, in addition to being also data intensive.

Introduction Existing techniques Example 1: Example 2:
Loading 170 million taxi pickup locations into PostgreSQL UPDATE t SET PUGeo = ST_SetSRID(ST_Point("PULong","PuLat"),4326); 105.8 hours! Example 2: Finding the nearest tax blocks for 170 million taxi pickup locations using open source libspatiaindex+GDAL 30.5 hours! Tuple(row)-based disk resident object-relational database Although there are already existing techniques including spatial databases and open source packages to process spatial data, the legacy code is mostly built on top of serial algorithms for uniprocssors in disk-resident storage environment and they are highly inefficient in processing large-scale spatial data. For example, in the first example, assuming all the 170 million taxi records in NYC in 2009 have been imported to PostgreSQL database, it is necessary to create a Geometry column before spatial queries can be processed in the PostGIS extension in Postgresql. Due to the tuple/row based data layout in PostgreSQL database, creating the Geometry column along would take more than 100+ hours. As another example, our implementation on computing the nearest tax blocks (polygons) for the 170 million taxi pickup locations using libspatialindex for spatial indexing and GDAL for point-to-polygon distance calculation would incur 30+ hours, even only the pickup location (latitude/longitude) is used which already eliminates unnecessary disk I/Os and database overheads. A preliminary analysis shows that GDAL calls GEOS library for geometry operations which allocate and deallocate memory frequently and can not make full use of large memory capacities on modern hardware. GEOS was optimized for hardware more than two decades ago when memory capacity is very small (in the order of MBs) and memory-related operations are far less expensive than floating point computation, when compared to modern hardware. Serial algorithm on uniprocessors with excessive memory allocation/deallocation and other overheads

ASCI Red: 1997 First 1 Teraflops (sustained) system with Intel Pentium II Xeon processors (in 72 Cabinets) $$$? Space? Power? Nvidia GTX Titan (Feb. 2013) 7.1 billion transistors (551mm²) 2,688 processors Max bandwidth GB/s PCI-E peripheral device 250 W (17.98 GFLOPS/W -SP) Suggested retail price: $999 The right side of the slide shows a Nvidia GTX Titan released more than four years ago, which has nearly 3000 processors and provided a Teraflops computing power yet only consumes 250W and cost under $1000. This is a significant improvement over the world’s first 1 Teraflops system in 1997 with respect to cost, space and power consumption, shown on the left. The question is what can we do with the device to accelerate spatial data processing, and spatial joins in particular. What can we do today using a device that is more powerful than ASCI Red 20 years ago?

Parallel Spatial Join Framework on GPUs
SELECT * from T1, T2 WHERE ST_OP (T1.the_geom, T2.the_geom) Spatial Data Processing Spatial Data Management (Spatial Databases) Spatial Indexing Spatial Join Spatial indexing on quadrants or MBBs (Minimum Bounding Boxes) of polyline/polygons Spatial join includes the filtering phase and refinement phase Spatial filtering uses spatial indices to match pairs of MBBs in T1 and T2 based on spatial intersection (overlapping) and reduce complexity from O(n1*n2) to O(n1) or O(n2) Spatial refinement performs geometrical operations (e.g., point-to-polyline distance or point-to-polygon test) on matched pairs Many spatial indexing and spatial join techniques have been proposed in the past few decades. Filtering phase Refinement phases Point quadrant Polyline/polygon MBB

Lightweight selectivity estimation to save memory Flat grid-file based indexing: simple yet memory-intensive Spatial Filtering (global operation, using spatial indexing) Spatial Refinement (local operation, batch-friendly, low memory footprint, computing intensive) This slide gives more detailed on a spatial join framework using on-demand flat grid-file based spatial indexing for spatial joins on GPUs. We would like to use the framework to illustrate the role of the proposed selectivity estimation technique in the framework. Selectivity estimation (top-right part) is considered a vital component in query optimization in both relational databases and spatial databases. Given a set of query items in T1, selectivity estimation techniques estimate the numbers of items in T2 that are likely to be joined with each of the query items. Fast and accurate selectivity estimations can help database query optimizers to choose better query plans under resource constraints. Some selectivity estimation technique rely on sophisticated spatial indexing for accurate estimate. In this study, we assume there are pre-existing spatial indices for neither of the two input datasets and use a simple flat grid file based indexing structure for on-the-fly spatial indexing. The grid-file index structure will works well with the selectivity estimation technique as both of them uses regularly spaced grids. Flat grid file based indexing is easy to parallel and easy to implement and use, but is known to have large memory footprint, which grows with grid levels (2^k*2^k). As such, it is important to find a suitable k that provides the best spatial filtering and respect memory budget (more details in the next slide). Here are some more information on spatial filtering using flat grid-file based spatial indexing. In the middle of the figure, after both MBBs of geometric objects are aligned to one or more grid cells, generating (P, Q) pairs can be transformed into a binary search problem. For each grid cell in the VPC vector, which stores the one-to-many mappings between the MBB of a geometric object in T1 to the grid cells that the MBB intersects, we search the cell in the VQC vector, which stores the one-to-many mappings between the MBB of a geometric object in T2 to the grid cells that the MBB intersects (center). The corresponding cell identifies in VPP and VQQ will form a (P,Q) pair. The matched objects in T1 and T2 will be paired for refinement. Clearly, for MBB pairs that cover multiple grid cells, the (P, Q) pairs will be duplicated and the duplication needs to be removed to avoid redundant spatial refinements. During the refinement phase, unique (P, Q) pairs will be assigned to GPU computing blocks for parallel processing (bottom).

Using flat grid-file based spatial filtering: The number of possible matches between a unique (P,Q) pair is bounded by |B1i|*|B2i|, |B1i| and |B2i| are the number of MBBs that overlap with cell i from T1 and T2 The total number of pairs after binary search but before removing duplicates is , which is the bottleneck of GPU memory footprint Our selectivity estimation technique computes S for multiple grid levels and choose the one with minimum memory footprint A major issue that we have encountered in applying the framework for practical spatial join applications is the difficulties in choosing grid cell resolutions which have a significant impact on GPU memory consumption in the filtering phase. When a large cell size is chosen (coarse grid), more MBBs from both input datasets will be associated with non-empty grid cells. When a small cell size is chosen (fine grid), while |B1i| and |B2i| are likely to be smaller, N usually grows quadratically which may also incur a large number of intermediate pairs. In this small example, the third (optimal) grid provides the smallest number of pairs and storing the pairs consumes least amount of memory and is preferred. As such, before the flat grid-file based spatial indexing step, we would like to perform a selectivity estimation step to help choose the best grid level that also respect memory budget for the indexing by computing S at multiple grid levels. It turns out this can be conveniently and efficiently computed in parallel on GPU hardware. I want to add that we have implemented several spatial indexing and spatial refinement techniques on GPUs and we refer to the sources of the related works.

Parallel Selectivity Estimation on GPUs
Basic idea: compute |B1i| and |B2i| values efficiently on GPUs so that S can be computed on GPUs by a parallel reduction (prefix sum) which is known to be efficient on GPUs How: treat the numbers of MBB corners that fall within grid cells as pixel values and then apply Summed-Area-Tables algorithm in image processing, which is known to be parallelizable Ni= Hll(x2,y2) - Hlr(x1-1,y2) - Hul(x2,y1-1) + Hur (x1-1,y1-1) (Jin et al, ICDE’00 [8]) setting x1=x2=c and y1=y2=r will give |B1i| at (r,c) (0,0) The value at any point (x, y) in the summed area table is just the sum of all the pixels above and to the left of (x, y), The basic idea How to do this The lower-left part shows the definition and illustration of summed area table… Assuming A, B, C and D are the values of the four corners of rectangle ABCD in the summed area table, then the sum of all the pixel values within rectangle ABCD is D-B-C+A. Summed-Area-Table has been applied to spatial selectivity estimation in the form of cumulative histogram that handles MBBs instead of pixels. Ref [8] provides a formula to calculate the number of MBBs that intersect with a query window by using the four summed area tables for the four corners of MBBs, respectively. The lower-right part shows a concrete example. Assuming the summed area tables of the four corners of a rectangle are computed, the number of MBBs that intersect with a query window can be computed according to the equation shown in the right-middle part of the slide. The application to selectivity estimation for a cell by setting x1=x2=c and y1=y2=r at cell with row r and column c where (x1,y1,x2, y2) represents a MBB.

An example on generating Hll (Similarly for Hlr, Hul and Hur) Point aggregation (0,0) Scan by row We use a simple example to illustrate how Hll is computed. First we extract the coordinates of the lower-left corner (x1,y1) of all MBBs and determine the cells they fall within. After this step, each cell records the number of points that fall within it and we call it “Point Aggregation” step. We next perform a scan or prefix-sum on a row and then transpose the resulting matrix. The same scan+transpose combo step is applied again to generate the final Hll. Note that the result of the second transpose remains the same, this is because the matrix for the example is symmetrical as the 1s are along diagonal. A better example would be to have multiple points within some grid cells and the matrix is not symmetrical. Scan by row Hll Transpose Transpose

Simple in design, highly parallelizable and easy to implement
Parallel Selectivity Estimation on GPUs Simple in design, highly parallelizable and easy to implement Funtor: function object ( In Thurst, functors work with parallel primitives (Transform in this case) and defines the logic on what to do for each and every element in a vector (in parallel) Our technique is simple in design, highly parallelizable and easy to implement. The algorithm has a main procedure (selelcivty_estimation) and three modules, all are straightforward. The whole process can be considered as chaining several well-understood and efficiently implemented parallel primitives. The lines underlined by red are directly application of Thrust API and the lines underlined by purple are a combination of thrust transform parallel primitive with user defined functors that tells how to process each and every element in the input vectors to generate output vector (i.e., map in MapReduce). The lines underlined by green can also implemented by thrust APIs but we have also implemented them natively using CUDA (examples next slide) for efficiency reasons. (scatter) Thrust API Thrust API(transform)+user defined functor CUDA code

Tile-based CUDA implementation of matrix transpose using shared memory Based on: template <typename T> __global__ void transposeKernel(T *odata, T *idata, int width, int height ) { __shared__ T tile[TILE_DIM][TILE_DIM+1]; int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height; for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; __syncthreads(); odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } void cudaTranspose(T *odata, T *idata, int size_x, int size_y) dim3 grid(size_x/TILE_DIM, size_y/TILE_DIM), threads(TILE_DIM,BLOCK_ROWS); transposeKernel<<<grid, threads>>>(odata, idata, size_x, size_y); #define TILE_DIM 32 #define BLOCK_ROWS 8 Thrust implementation of matrix transpose (using gather and a functor) struct transpose_index : public thrust::unary_function<size_t,size_t>{ size_t m, n; __host__ __device__ transpose_index(size_t _m, size_t _n) : m(_m), n(_n) {} size_t operator()(size_t linear_index) { size_t i = linear_index / n; size_t j = linear_index % n; return m * j + i; } }; thrust::gather (thrust::make_transform_iterator(indices, transpose_index(n, m)), thrust::make_transform_iterator(indices, transpose_index(n, m)) + dst.size(), src.begin(), dst.begin()); CUDA native implementation is several times faster! Now let us look into some of the details of the implementations on GPUs. (skip if time is limited) The first panel shows the CUDA implementation of the 2D matrix transpose parallel primitive as a CUDA kernel. By dividing a large array into tiles of (TILE_DIM*TILE_DIM) size, a thread block with TILE_DIM*BLOCK_ROWS threads is used to process a tile. Note that TILE_DIM (32) is a multiple of BLOCK_ROWS (8), a thread needs to perform TILE_DIM/BLOCK_ROWS loops for both read-in and write-out. Shared memory (__shared__ T tile) is used to for coalesced memory accesses among threads (not perfect but best efforts). (optional detail: in the two for loops, index_in and index_out (after expansions) has threadIdx.x as the slowest changing term, this means that both reading-to shared memory from global memory and writing-from shared memory to global memory are coalesced by using 2D shared memory as the scratch pad which is faster and has more efficient support for random accesses.) (explaining setting BLOCK_ROWS is more complicated …, asked them to if the question is raised) The second panel shows the Thrust implementation by using the gather parallel primitive and the transpose_index functor. It is much easier to understand but the performance is several times worse than the CUDA implementation. We initially implemented the Thrust version but then developed the CUDA version as an optimization. The experiments use the CUDA version for efficiency.

template <typename T> __global__ void horizontalScanKernel(size_t width, T *data) { extern __shared__ T pool[]; T *s_Data = (T *)pool; size_t elements = width; int baseIdx = blockIdx.x * width; int shift = 0; int maxIdx; int num_threads = blockDim.x; __shared__ T prev; if (threadIdx.x == 0) prev = 0; while (elements > 0) __syncthreads(); maxIdx = (elements > num_threads) ? num_threads : elements; T tmp, val; //first load data from global memory to shared memory val = (threadIdx.x < maxIdx) ? data[baseIdx+ (threadIdx.x + shift)] : 0; if (threadIdx.x == 0) val += prev; s_Data[threadIdx.x] = val; if (maxIdx > 1) { if(threadIdx.x >= 1) { tmp = s_Data[threadIdx.x - 1]; val += tmp; } __syncthreads(); s_Data[threadIdx.x] = val; __syncthreads(); } if (maxIdx > 2) { if(threadIdx.x >= 2) { tmp = s_Data[threadIdx.x - 2]; val += tmp; } … if (threadIdx.x < maxIdx) data[baseIdx + (threadIdx.x + shift)] = val; if (threadIdx.x == maxIdx - 1) prev = val; elements -= num_threads; shift += num_threads; } CUDA native implementation of Scan(prefix-sum) by row (allows arbitrary length) Break a long row into segments Similarly, instead of using the inclusive_scan_by_key parallel primitives by setting a row as a key (our first implementation) to implement “scan-by-row”, we have developed a native CUDA implementation as an optimization. The CUDA version lets each thread block to perform a scan on a single row. Since the row may be much larger than the number of threads in a thread block, we break the row into several segments and perform a scan in each segment in parallel. The prefix-sum of a previous segment is carried to the next segment as the initial value (prev). For prefix-sum in each segment, shared memory is used for efficiency. Scan on a segment of a row based on shared memory capacity

Experiment Datasets Platform Grid Levels
MBBs of point quadrants of 170 million taxi trip pickup locations in NYC (2009)|Taxi|=1,746,795 MBBs of NYC MapPluto tax lot |Pluto|= 735,488 Platform 2013 Nvidia GTX Titan GPU device with 2,688 cores and 6 GB memory Grid Levels 2n*2n for n=10,11,12,13 1024*1024 to 8192*8192

Experiment Saves 34.8% memory using 321ms computation time!
SUM(T)=321 ms (AvgN-minN)/AvgN=34.8% Grid Level k Grid Size # of Estimated Pairs (N) Runtime (Ts) (ms) 13 8192*8192 205 12 4096*4096 63 11 2048*2048 31 10 1024*1024 22 Saves 34.8% memory using 321ms computation time! gen_sat time dominates the over runtimes for large grid sizes, but is insignificant For small grid sizes use smaller grid sizes for Faster but possibly sub-optimal selectivity estimations Assuming that the chance of picking all grid resolutions for spatial filtering is the same, then the expected number of estimated pair is AvgN=ƩNi/4. After applying the selectivity estimation algorithm, we are able to pick up the grid resolution that has the minimum number of pairs minN=min(Ni) Thus the benefit is (AvgN-minN)/AvgN=34.8%. Data (volume <30MB) transfer time (~5ms) from CPU to GPU is negligible and is not counted

Conclusion and Future Work
Summary Provide a simple parallel selectivity estimation technique to reduce memory footprint in spatial join processing on GPUs where memory capacity is typically a limiting factor Experiments on joining the two MBB sets with MBBs at the orders of millions have shown that our technique is able to reduce memory footprint by 34.8% in about 1/3 of a second – desirable and practically useful Future work Lots of performance optimization opportunities: Simultaneous multi-level point aggregation Using compressed point aggregation results to reduce its runtime Kernel fusion to reduce GPU global memory access overheads among the 10+ parallel primitives multi-level approach to extend the technique using CPU memory as a buffer dynamically bringing blocks to GPU memory for block-wise Combining blocks to support large grids

Thanks Q&A

Jianting Zhang1,2, Simin You2,4, Le Gruenwald3

Similar presentations

Presentation on theme: "Jianting Zhang1,2, Simin You2,4, Le Gruenwald3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jianting Zhang1,2, Simin You2,4, Le Gruenwald3

Similar presentations

Presentation on theme: "Jianting Zhang1,2, Simin You2,4, Le Gruenwald3"— Presentation transcript:

Similar presentations

About project

Feedback