Jianting Zhang1,2, Simin You2,4, Le Gruenwald3

Slides:



Advertisements
Similar presentations
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Advertisements

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
Topic Overview One-to-All Broadcast and All-to-One Reduction
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
U 2 SOD-DB: A Database System to Manage Large-Scale Ubiquitous Urban Sensing Origin-Destination Data Jianting Zhang 134 Hongmian Gong 234 Camille Kamga.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Implementing a Speech Recognition System on a GPU using CUDA
Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Daniel A. Keim, Hans-Peter Kriegel Institute for Computer Science, University of Munich 3/23/ VisDB: Database exploration using Multidimensional.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Dense-Region Based Compact Data Cube
CS/EE 217 – GPU Architecture and Parallel Programming
Chapter 2 Memory and process management
CHP - 9 File Structures.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Database Management System
Advanced QlikView Performance Tuning Techniques
Operating Systems Disk Scheduling A. Frank - P. Weisberg.
How will execution time grow with SIZE?
Chapter 25: Advanced Data Types and New Applications
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
The Hardware/Software Interface CSE351 Winter 2013
So far we have covered … Basic visualization algorithms
Real-Time Ray Tracing Stefan Popov.
Review Graph Directed Graph Undirected Graph Sub-Graph
Chapter 12: Query Processing
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
CSCE 990: Advanced Distributed Systems
Query Processing in Databases Dr. M. Gavrilova
Towards GPU-Accelerated Web-GIS
Web Data Extraction Based on Partial Tree Alignment
Spatial Data Models Raster uses individual cells in a matrix, or grid, format to represent real world entities Vector uses coordinates to store the shape.
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Jianting Zhang City College of New York
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
Objective of This Course
Spatio-Temporal Databases
Unit-2 Divide and Conquer
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Parallel Computation Patterns (Scan)
Wavelet “Block-Processing” for Reduced Memory Transfers
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Lecture 2- Query Processing (continued)
Parallelization of Sparse Coding & Dictionary Learning
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
One-Pass Algorithms for Database Operations (15.2)
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Outline Summary an Future Work Introduction
CENG 351 Data Management and File Structures
Prototyping A Web-based High-Performance Visual Analytics Platform for Origin-Destination Data: A Case study of NYC Taxi Trip Records Jianting Zhang1,2.
Memory System Performance Chapter 3
Convolution Layer Optimization
COMP60611 Fundamentals of Parallel and Distributed Systems
6- General Purpose GPU Programming
Donghui Zhang, Tian Xia Northeastern University
Scalable light field coding using weighted binary images
Efficient Aggregation over Objects with Extent
Presentation transcript:

Jianting Zhang1,2, Simin You2,4, Le Gruenwald3 Parallel Selectivity Estimation for Optimizing Multidimensional Spatial Join Processing on GPUs Jianting Zhang1,2, Simin You2,4, Le Gruenwald3 1The City College of New York 2CUNY Graduate Center 3University of Oklahoma 4Pitney Bowes Inc.

Overview Introduction & Background Parallel Spatial Join Framework on GPUs Parallel Selectivity Estimation on GPUs Experiments and Discussions Conclusion and Future Work

Introduction NYC Taxi Trip Data Taxicabs 13,000 Medallion taxi cabs Spatial data volumes are fast increasing due to advances of locating, sensing and simulation techniques. If you have come to NYC and hail a taxi or yellow cab, you probably have seen this. For every taxi trip, dozens of columns of attributes will be recorded, including pick up and drop-off locations and time. There are about 170 million trips in 2009 in NYC and we will be using the pickup locations in our experiments. Now many of you have a smartphone and there is a GPS embedded. Given that nearly 1.5 billion smartphones were sold in 2015, you can imagine how much spatial data has been generated every minute, every hour, day and year. Very often different spatial datasets need to be joined to derive new information and knowledge to support decision making. For example, GPS traces can be better interpreted when aligned with urban infrastructures, such as road networks and Point of Interests (POIs), through spatial joins. Introduction NYC Taxi Trip Data Taxicabs 13,000 Medallion taxi cabs License priced at $600, 000 in 2007 Car services and taxi services are separate Taxi trip records ~170 million trips (300 million passengers) in 2009 1/5 of that of subway riders and 1/3 of that of bus riders in NYC 3 3

Introduction More Spatial Data In addition to spatial data related to location-dependent services, spatial data is virtually everywhere at multiple scales, such as cosmology, earth sciences, regional ecology, urban, all the way down to human body, organisms, cells and atoms. Spatial data processing on large-scale data, including spatial join, is challenging due to the uniqueness of spatial data when compared to relational data. Spatial data are inherently multi-dimensional and lack total ordering that preserves proximity. Geometry operators, such as point-in-polygon test, typically involve considerable floating point computation and is much more computing intensive than relational queries, in addition to being also data intensive.

Introduction Existing techniques Example 1: Example 2: Loading 170 million taxi pickup locations into PostgreSQL UPDATE t SET PUGeo = ST_SetSRID(ST_Point("PULong","PuLat"),4326); 105.8 hours! Example 2: Finding the nearest tax blocks for 170 million taxi pickup locations using open source libspatiaindex+GDAL 30.5 hours! Tuple(row)-based disk resident object-relational database Although there are already existing techniques including spatial databases and open source packages to process spatial data, the legacy code is mostly built on top of serial algorithms for uniprocssors in disk-resident storage environment and they are highly inefficient in processing large-scale spatial data. For example, in the first example, assuming all the 170 million taxi records in NYC in 2009 have been imported to PostgreSQL database, it is necessary to create a Geometry column before spatial queries can be processed in the PostGIS extension in Postgresql. Due to the tuple/row based data layout in PostgreSQL database, creating the Geometry column along would take more than 100+ hours. As another example, our implementation on computing the nearest tax blocks (polygons) for the 170 million taxi pickup locations using libspatialindex for spatial indexing and GDAL for point-to-polygon distance calculation would incur 30+ hours, even only the pickup location (latitude/longitude) is used which already eliminates unnecessary disk I/Os and database overheads. A preliminary analysis shows that GDAL calls GEOS library for geometry operations which allocate and deallocate memory frequently and can not make full use of large memory capacities on modern hardware. GEOS was optimized for hardware more than two decades ago when memory capacity is very small (in the order of MBs) and memory-related operations are far less expensive than floating point computation, when compared to modern hardware. Serial algorithm on uniprocessors with excessive memory allocation/deallocation and other overheads

ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets) $$$? Space? Power? Nvidia GTX Titan (Feb. 2013) 7.1 billion transistors (551mm²) 2,688 processors Max bandwidth 288.4 GB/s PCI-E peripheral device 250 W (17.98 GFLOPS/W -SP) Suggested retail price: $999 The right side of the slide shows a Nvidia GTX Titan released more than four years ago, which has nearly 3000 processors and provided a Teraflops computing power yet only consumes 250W and cost under $1000. This is a significant improvement over the world’s first 1 Teraflops system in 1997 with respect to cost, space and power consumption, shown on the left. The question is what can we do with the device to accelerate spatial data processing, and spatial joins in particular. What can we do today using a device that is more powerful than ASCI Red 20 years ago?

Parallel Spatial Join Framework on GPUs SELECT * from T1, T2 WHERE ST_OP (T1.the_geom, T2.the_geom) Spatial Data Processing Spatial Data Management (Spatial Databases) Spatial Indexing Spatial Join Spatial indexing on quadrants or MBBs (Minimum Bounding Boxes) of polyline/polygons Spatial join includes the filtering phase and refinement phase Spatial filtering uses spatial indices to match pairs of MBBs in T1 and T2 based on spatial intersection (overlapping) and reduce complexity from O(n1*n2) to O(n1) or O(n2) Spatial refinement performs geometrical operations (e.g., point-to-polyline distance or point-to-polygon test) on matched pairs Many spatial indexing and spatial join techniques have been proposed in the past few decades. Filtering phase Refinement phases Point quadrant Polyline/polygon MBB

Parallel Spatial Join Framework on GPUs Lightweight selectivity estimation to save memory Flat grid-file based indexing: simple yet memory-intensive Spatial Filtering (global operation, using spatial indexing) Spatial Refinement (local operation, batch-friendly, low memory footprint, computing intensive) This slide gives more detailed on a spatial join framework using on-demand flat grid-file based spatial indexing for spatial joins on GPUs. We would like to use the framework to illustrate the role of the proposed selectivity estimation technique in the framework. Selectivity estimation (top-right part) is considered a vital component in query optimization in both relational databases and spatial databases. Given a set of query items in T1, selectivity estimation techniques estimate the numbers of items in T2 that are likely to be joined with each of the query items. Fast and accurate selectivity estimations can help database query optimizers to choose better query plans under resource constraints. Some selectivity estimation technique rely on sophisticated spatial indexing for accurate estimate. In this study, we assume there are pre-existing spatial indices for neither of the two input datasets and use a simple flat grid file based indexing structure for on-the-fly spatial indexing. The grid-file index structure will works well with the selectivity estimation technique as both of them uses regularly spaced grids. Flat grid file based indexing is easy to parallel and easy to implement and use, but is known to have large memory footprint, which grows with grid levels (2^k*2^k). As such, it is important to find a suitable k that provides the best spatial filtering and respect memory budget (more details in the next slide). Here are some more information on spatial filtering using flat grid-file based spatial indexing. In the middle of the figure, after both MBBs of geometric objects are aligned to one or more grid cells, generating (P, Q) pairs can be transformed into a binary search problem. For each grid cell in the VPC vector, which stores the one-to-many mappings between the MBB of a geometric object in T1 to the grid cells that the MBB intersects, we search the cell in the VQC vector, which stores the one-to-many mappings between the MBB of a geometric object in T2 to the grid cells that the MBB intersects (center). The corresponding cell identifies in VPP and VQQ will form a (P,Q) pair. The matched objects in T1 and T2 will be paired for refinement. Clearly, for MBB pairs that cover multiple grid cells, the (P, Q) pairs will be duplicated and the duplication needs to be removed to avoid redundant spatial refinements. During the refinement phase, unique (P, Q) pairs will be assigned to GPU computing blocks for parallel processing (bottom).

Parallel Spatial Join Framework on GPUs Using flat grid-file based spatial filtering: The number of possible matches between a unique (P,Q) pair is bounded by |B1i|*|B2i|, |B1i| and |B2i| are the number of MBBs that overlap with cell i from T1 and T2 The total number of pairs after binary search but before removing duplicates is , which is the bottleneck of GPU memory footprint Our selectivity estimation technique computes S for multiple grid levels and choose the one with minimum memory footprint A major issue that we have encountered in applying the framework for practical spatial join applications is the difficulties in choosing grid cell resolutions which have a significant impact on GPU memory consumption in the filtering phase. When a large cell size is chosen (coarse grid), more MBBs from both input datasets will be associated with non-empty grid cells. When a small cell size is chosen (fine grid), while |B1i| and |B2i| are likely to be smaller, N usually grows quadratically which may also incur a large number of intermediate pairs. In this small example, the third (optimal) grid provides the smallest number of pairs and storing the pairs consumes least amount of memory and is preferred. As such, before the flat grid-file based spatial indexing step, we would like to perform a selectivity estimation step to help choose the best grid level that also respect memory budget for the indexing by computing S at multiple grid levels. It turns out this can be conveniently and efficiently computed in parallel on GPU hardware. I want to add that we have implemented several spatial indexing and spatial refinement techniques on GPUs and we refer to the sources of the related works.

Parallel Selectivity Estimation on GPUs Basic idea: compute |B1i| and |B2i| values efficiently on GPUs so that S can be computed on GPUs by a parallel reduction (prefix sum) which is known to be efficient on GPUs How: treat the numbers of MBB corners that fall within grid cells as pixel values and then apply Summed-Area-Tables algorithm in image processing, which is known to be parallelizable Ni= Hll(x2,y2) - Hlr(x1-1,y2) - Hul(x2,y1-1) + Hur (x1-1,y1-1) (Jin et al, ICDE’00 [8]) setting x1=x2=c and y1=y2=r will give |B1i| at (r,c) (0,0) https://en.wikipedia.org/wiki/Summed_area_table The value at any point (x, y) in the summed area table is just the sum of all the pixels above and to the left of (x, y), The basic idea How to do this The lower-left part shows the definition and illustration of summed area table… Assuming A, B, C and D are the values of the four corners of rectangle ABCD in the summed area table, then the sum of all the pixel values within rectangle ABCD is D-B-C+A. Summed-Area-Table has been applied to spatial selectivity estimation in the form of cumulative histogram that handles MBBs instead of pixels. Ref [8] provides a formula to calculate the number of MBBs that intersect with a query window by using the four summed area tables for the four corners of MBBs, respectively. The lower-right part shows a concrete example. Assuming the summed area tables of the four corners of a rectangle are computed, the number of MBBs that intersect with a query window can be computed according to the equation shown in the right-middle part of the slide. The application to selectivity estimation for a cell by setting x1=x2=c and y1=y2=r at cell with row r and column c where (x1,y1,x2, y2) represents a MBB.

Parallel Selectivity Estimation on GPUs An example on generating Hll (Similarly for Hlr, Hul and Hur) Point aggregation (0,0) Scan by row We use a simple example to illustrate how Hll is computed. First we extract the coordinates of the lower-left corner (x1,y1) of all MBBs and determine the cells they fall within. After this step, each cell records the number of points that fall within it and we call it “Point Aggregation” step. We next perform a scan or prefix-sum on a row and then transpose the resulting matrix. The same scan+transpose combo step is applied again to generate the final Hll. Note that the result of the second transpose remains the same, this is because the matrix for the example is symmetrical as the 1s are along diagonal. A better example would be to have multiple points within some grid cells and the matrix is not symmetrical. Scan by row Hll Transpose Transpose

Simple in design, highly parallelizable and easy to implement Parallel Selectivity Estimation on GPUs Simple in design, highly parallelizable and easy to implement Funtor: function object (https://en.wikipedia.org/wiki/Function_object) In Thurst, functors work with parallel primitives (Transform in this case) and defines the logic on what to do for each and every element in a vector (in parallel) Our technique is simple in design, highly parallelizable and easy to implement. The algorithm has a main procedure (selelcivty_estimation) and three modules, all are straightforward. The whole process can be considered as chaining several well-understood and efficiently implemented parallel primitives. The lines underlined by red are directly application of Thrust API and the lines underlined by purple are a combination of thrust transform parallel primitive with user defined functors that tells how to process each and every element in the input vectors to generate output vector (i.e., map in MapReduce). The lines underlined by green can also implemented by thrust APIs but we have also implemented them natively using CUDA (examples next slide) for efficiency reasons. (scatter) Thrust API Thrust API(transform)+user defined functor CUDA code

Parallel Selectivity Estimation on GPUs Tile-based CUDA implementation of matrix transpose using shared memory Based on: https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/transpose/transpose.cu template <typename T> __global__ void transposeKernel(T *odata, T *idata, int width, int height ) { __shared__ T tile[TILE_DIM][TILE_DIM+1]; int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width; xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height; for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; __syncthreads(); odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } void cudaTranspose(T *odata, T *idata, int size_x, int size_y) dim3 grid(size_x/TILE_DIM, size_y/TILE_DIM), threads(TILE_DIM,BLOCK_ROWS); transposeKernel<<<grid, threads>>>(odata, idata, size_x, size_y); #define TILE_DIM 32 #define BLOCK_ROWS 8 Thrust implementation of matrix transpose (using gather and a functor) struct transpose_index : public thrust::unary_function<size_t,size_t>{ size_t m, n; __host__ __device__ transpose_index(size_t _m, size_t _n) : m(_m), n(_n) {} size_t operator()(size_t linear_index) { size_t i = linear_index / n; size_t j = linear_index % n; return m * j + i; } }; thrust::gather (thrust::make_transform_iterator(indices, transpose_index(n, m)), thrust::make_transform_iterator(indices, transpose_index(n, m)) + dst.size(), src.begin(), dst.begin()); CUDA native implementation is several times faster! Now let us look into some of the details of the implementations on GPUs. (skip if time is limited) The first panel shows the CUDA implementation of the 2D matrix transpose parallel primitive as a CUDA kernel. By dividing a large array into tiles of (TILE_DIM*TILE_DIM) size, a thread block with TILE_DIM*BLOCK_ROWS threads is used to process a tile. Note that TILE_DIM (32) is a multiple of BLOCK_ROWS (8), a thread needs to perform TILE_DIM/BLOCK_ROWS loops for both read-in and write-out. Shared memory (__shared__ T tile) is used to for coalesced memory accesses among threads (not perfect but best efforts). (optional detail: in the two for loops, index_in and index_out (after expansions) has threadIdx.x as the slowest changing term, this means that both reading-to shared memory from global memory and writing-from shared memory to global memory are coalesced by using 2D shared memory as the scratch pad which is faster and has more efficient support for random accesses.) (explaining setting BLOCK_ROWS is more complicated …, asked them to email if the question is raised) The second panel shows the Thrust implementation by using the gather parallel primitive and the transpose_index functor. It is much easier to understand but the performance is several times worse than the CUDA implementation. We initially implemented the Thrust version but then developed the CUDA version as an optimization. The experiments use the CUDA version for efficiency.

Parallel Selectivity Estimation on GPUs template <typename T> __global__ void horizontalScanKernel(size_t width, T *data) { extern __shared__ T pool[]; T *s_Data = (T *)pool; size_t elements = width; int baseIdx = blockIdx.x * width; int shift = 0; int maxIdx; int num_threads = blockDim.x; __shared__ T prev; if (threadIdx.x == 0) prev = 0; while (elements > 0) __syncthreads(); maxIdx = (elements > num_threads) ? num_threads : elements; T tmp, val; //first load data from global memory to shared memory val = (threadIdx.x < maxIdx) ? data[baseIdx+ (threadIdx.x + shift)] : 0; if (threadIdx.x == 0) val += prev; s_Data[threadIdx.x] = val; if (maxIdx > 1) { if(threadIdx.x >= 1) { tmp = s_Data[threadIdx.x - 1]; val += tmp; } __syncthreads(); s_Data[threadIdx.x] = val; __syncthreads(); } if (maxIdx > 2) { if(threadIdx.x >= 2) { tmp = s_Data[threadIdx.x - 2]; val += tmp; } … if (threadIdx.x < maxIdx) data[baseIdx + (threadIdx.x + shift)] = val; if (threadIdx.x == maxIdx - 1) prev = val; elements -= num_threads; shift += num_threads; } CUDA native implementation of Scan(prefix-sum) by row (allows arbitrary length) Break a long row into segments Similarly, instead of using the inclusive_scan_by_key parallel primitives by setting a row as a key (our first implementation) to implement “scan-by-row”, we have developed a native CUDA implementation as an optimization. The CUDA version lets each thread block to perform a scan on a single row. Since the row may be much larger than the number of threads in a thread block, we break the row into several segments and perform a scan in each segment in parallel. The prefix-sum of a previous segment is carried to the next segment as the initial value (prev). For prefix-sum in each segment, shared memory is used for efficiency. Scan on a segment of a row based on shared memory capacity

Experiment Datasets Platform Grid Levels MBBs of point quadrants of 170 million taxi trip pickup locations in NYC (2009)|Taxi|=1,746,795 MBBs of NYC MapPluto tax lot |Pluto|= 735,488 Platform 2013 Nvidia GTX Titan GPU device with 2,688 cores and 6 GB memory Grid Levels 2n*2n for n=10,11,12,13 1024*1024 to 8192*8192

Experiment Saves 34.8% memory using 321ms computation time! SUM(T)=321 ms (AvgN-minN)/AvgN=34.8% Grid Level k Grid Size # of Estimated Pairs (N) Runtime (Ts) (ms) 13 8192*8192 78328554 205 12 4096*4096 40414590 63 11 2048*2048 43121125 31 10 1024*1024 86103593 22 Saves 34.8% memory using 321ms computation time! gen_sat time dominates the over runtimes for large grid sizes, but is insignificant For small grid sizes use smaller grid sizes for Faster but possibly sub-optimal selectivity estimations Assuming that the chance of picking all grid resolutions for spatial filtering is the same, then the expected number of estimated pair is AvgN=ƩNi/4. After applying the selectivity estimation algorithm, we are able to pick up the grid resolution that has the minimum number of pairs minN=min(Ni) Thus the benefit is (AvgN-minN)/AvgN=34.8%. Data (volume <30MB) transfer time (~5ms) from CPU to GPU is negligible and is not counted

Conclusion and Future Work Summary Provide a simple parallel selectivity estimation technique to reduce memory footprint in spatial join processing on GPUs where memory capacity is typically a limiting factor Experiments on joining the two MBB sets with MBBs at the orders of millions have shown that our technique is able to reduce memory footprint by 34.8% in about 1/3 of a second – desirable and practically useful Future work Lots of performance optimization opportunities: Simultaneous multi-level point aggregation Using compressed point aggregation results to reduce its runtime Kernel fusion to reduce GPU global memory access overheads among the 10+ parallel primitives multi-level approach to extend the technique using CPU memory as a buffer dynamically bringing blocks to GPU memory for block-wise Combining blocks to support large grids

Thanks Q&A