Download presentation
Presentation is loading. Please wait.
Published byYuliana Kusumo Modified over 6 years ago
1
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs
Speaker: Dr. Jianting Zhang Dept. of Computer Science, The City College of New York & CUNY Graduate Center Host: Dr. Dali Wang
2
Ecological Informatics
Computer Science Spatial Databases Data Mining Geography GIS Applications Remote Sensing Environmental sciences Computer Science 2
3
A conceptual framework of high-performance environmental data management
High-resolution Satellite Imagery T In-situ Observation Sensor Data Global and Regional Climate Model Outputs Data Assimilation Zonal Statistics Ecological, environmental and administrative zones V Temporal Trends High-End Computing Facility A B C Thread Block ROIs
4
Why GPU Computing? Today’s GPUs closely represent supercomputer architectures in the past and are essential building blocks of future exascale computing ASCI Red: 1997 First 1 Teraflops (sustained) system with Intel Pentium II Xeon processors (in 72 Cabinets) $$$? Space? Power? Feb. 2013 7.1 billion transistors (551mm²) 2,688 processors Max bandwidth GB/s PCI-E peripheral device 250 W (17.98 GFLOPS/W -SP) Suggested retail price: $999
5
Zonal Statistics Min/max/avg/median/… SQL:
SELECT COUNT(*) from T_O, T_Z WHERE ST_WITHIN (T_O.the_geom,T_Z.the_geom) GROUP BY T_Z.z_id; int pnpoly(int npol, float *xp, float *yp, float x, float y) { int i, j, c = 0; for (i = 0, j = npol-1; i < npol; j = i++) { if ((((yp[i] <= y) && (y < yp[j])) || ((yp[j] <= y) && (y < yp[i]))) && (x < (xp[j] - xp[i]) * (y - yp[i]) / (yp[j] - yp[i]) + xp[i])) c = !c; } return c; Min/max/avg/median/… 2 3 For point/raster cell with value v in zone z: his[z][v]++ (serial code) To derive a complete histogram… How about Zonal Statistics on polygonal zones over raster cells? Approach 1: treating each raster cell as a point and then use point-in-polygon testing Approach 2: rasterizing polygon and then then use raster-based technique
6
Data and Computing Challenges
NASA Shuttle Radar Topography Mission (SRTM) data 11-day mission in February of 2000 Elevation data on a near-global scale The most complete high-resolution digital topographic database of Earth 1 arc-second resolution (~30 meter) total SRTM raw data volume 12.3TB Continental US (tiles 01-06) SRTM: 20*109 (billion) raster cells (~40GB raw, ~15GB compressed TIFF) Zones: 3141 counties, 87,097 vertices Brute-force Point-in-polygon test RT=(#of points)*(number of vertices)*(number of ops per point-in-polygon test)/(number of ops per second) =20*109*87097*20/(10*109)=3.7*106seconds=40days Using up all Titan’s 18,688 nodes: ~200 seconds Flops utilization is typically low: can be <1% for data intensive applications (typically <10% in HPC)
7
Hybrid Spatial Databases + HPC Approach
Observation : only points/cells that are close enough to polygons need to be tested Question: how do we pair neighboring points/cells with polygons? Raster block Step1: divide a raster tile into blocks and generate per-tile histograms Step 2: derive polygon MBRs and pair MBRs with blocks through box-in-polygon test (inside/intersect) Step 3: aggregate per-blocks histograms into per-polygon histograms if blocks are within polygons Step 4: for each intersected polygon/block pair, perform point(cell)-in-polygon test for all the raster cells in the blocks and update respective polygon histogram (B) Inside (C) Outside (A) Intersect Minimum Bounding Boxes (MBRs)
8
Hybrid Spatial Databases + HPC Approach
Per-block Aggregation Per-cell modification Advantage: raster cells within polygons do not need point-in-polygon test individually
9
GPU Implementations (1)
Polygon vertices Points Perfect coalesced memory accesses Utilizing GPU floating point power 4 Identifying parallelisms and mapping to hardware Deriving per-block histograms Block in polygon test Aggregate histograms for “within” blocks Point-in-polygon test for individual cells GPU Thread Block Raster Block AtomicAdd 1 M1 C1 C2 M2 C3 … Point-in-poly test for each of cell’s 4 corners All-pair Edge intersection tests between polygon and cell 2
10
GPU Implementations (2)
A Possible solution: Using compression libraries (e.g., zlib) to compress raw data Advantage: good compression ratio, saves disk I/O Disadvantage 1: requires decompression on CPUs Disadvantage 2: CPU memory footprint and CPUGPU data transfer overhead remain the same After incorporating spatial database technique, the task now becomes I/O bound: Reading raw data (40 GB) from disk to memory requires 400+ seconds (~100MB/s) Large memory footprint on CPUs, even operations are done GPUs Significant data transfer time between CPUs and GPUs (limited by PCI-E bandwidth 2-8 GB/s) Our solution: reusing Bitplane Quadtree (BPQ-Tree) for data compression/indexing Idea: chop M-bit rasters into M binary bitmaps and then build a quadtree for each bitmap BPQ-Tree achieves competitive compression ratio but is much more parallelization friendly on GPUs Advantage 1: compressed data is streamed from disk to GPU without requiring decompression on CPUs reducing CPU memory footprint and data transfer time (Advantage 2): can be used in conjunction with CPU-based compression to further improve compression ratio (verified but not presented)
11
GPU Implementations (3)
(Zhang et al 2011)
12
GPU Implementations (4)
BPQ-Tree Decoding on GPGPUs (Zhang et al 2011) All the threads assigned to a computing block are bundled together to process a quadrant of matrices in a BPQ-Tree pyramid during decoding. The collective process is looped over all the quadrants and all levels of the pyramid, i.e., Process Collectively and Loop (PCL). The starting positions of the threads in the byte streams are calculated efficiently on the fly in GPGPU shared memories – only the starting position of the computing block needs to be pre-generated. Step 0 3 2 1 3 2 1 Step 1 3 5 2 1 Step 2 3 5 6 Step 3 3 5 6
13
Experiments and Evaluations (1)
(Geospatial Technologies and Environmental Cyberinfrastructure) SGI-Octane III: 2-nodes mini-cluster (each with dual-quad core CPU, 48G mem, 4 TB disk, 2 C2050 GPUs Single Node Configuration 1: Dell T5400 WS Intel Xeon E5405 dual Quad-Core Processor (2.00 GHZ), 16 GB, PCI-E Gen2, 3*500GB 7200 RPM disk with 32M cache ($5,000) Nvidia Quadro 6000 GPU, 448 Fermi core ( 574 MHZ), 6 GB, 144GB/s ($4,500) Single Node Configuration 2: DIY WS Intel Core-i5 650 Dual-Core Processor (hyperthreading enabled), 8 GB, PCI-E Gen3, 500GB 7200 RPM disk with 32M (recycled), ($1000) Nvidia GTX Titan GPU, Kepler core ( 837 MHZ), 6 GB, 288 GB/s ($1000)
14
Experiments and Evaluations (2)
GPU Cluster 1: OLCF Titan 18,688 nodes, 299,008 cores, 20+ PFlops (GPU Cluster 2: CUNY HPCC Andy): 744 Nehalem CPU cores, 96 Fermi GPUs (GPU Cluster 3: CUNY HPCC Penzias): 1,172 Sandybridge CPU cores , 144 Kepler GPUs
15
Experiments and Evaluations (3)
Results on Data Pre-processing and BPQ-Tree Compression Fixed Parameters: Raster Block size: 0.1*0.1 degree 360*360 (resolution is 1*1 arc-second) Maximum histogram bins: 5000 Raster chunk size for encoding/decoding: 4096*4096 (a chunk is assigned to a block) Thread block size: 256 Data Format Volume (GB) Original (Raw) 38 TIFF Compression 15 gzip compression [1] 8.3 BPQ-Tree Compression [2] 7.3 BPQ-Tree+ gzip compression 5.5 Tile # dimension Partition Schema 1 54000*43200 2*2 2 50400*43200 3 4 82800*36000 5 61200*46800 6 68400*111600 4*4 Total 20,165,760,000 36 [1] asymmetrical: encoding(430s), decoding (47s); ~10X slower [2] conceptually symmetrical; practically encoding (195s), decoding (239s) – desirable for on-the-fly coding of intermediate model output data
16
Experiments and Evaluations (4)
Wall-clock end-to-end runtimes Cold Cache [1] Hot cache [2] Single Node Config1 (Dell T5400+Quadro 6000) 180s 78s Single Node Config2 (DIY+GTX Titan) 101s 46s [1] using “hdparm -t”==> MB/s ; [2] using “hdparm –T”==> GB/s Observations: The end-to-end runtimes are disk I/O bound, although BPQ-Tree has significantly reduced I/O time (400s+ if read raw data directly); parallel I/O (e.g., ADIOS) on clusters is promising in I/O intensive applications GTX Titan (based on Kepler) is significantly faster than Quadro 6000 (based on Fermi): more cores + larger caches; same graphics memory capacity (6GB) +much lower price tag; desirable
17
Experiments and Evaluations (5)
Breakdowns of major computing components (hot cache) Quadro 6000 GTX Titan 8-core CPU (Step 0): Raster decompression (s) 16.2 8.30 131.9 Step 1: Per-block histogramming (s) 21.5 13.4 / Step 2: Block-in-polygon test (s) 0.11 0.07 Step 3: “within-block” histogram aggregation (s) 0.14 Step 4: cell-in-polygon test and histogram update (s) 29.7 11.4 total major steps( s) 67.7 33.3 Wall-clock end-to-end (s) 85 46 Discussions: Parameters(e.g., block/chunk size and #of bins) may affect performance – additional experiments to be completed in future work GPU hardware architecture may also affect performance significantly: 8.1X speed up for Quadro 6000, 15.9X speedup for GTX Titan, and 5.8X speedup for C2050 (Zhang et al 2011), over 8-core CPU
18
Experiments and Evaluations (6)
Wall-clock end-to-end runtimes on Titan MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &myrank ); MPI_Comm_size( MPI_COMM_WORLD, &commsize ); printf("%d %d\n",myrank,commsize); assert(commsize>0); int round=(NUM_TILE-1)/commsize+1; double start = MPI_Wtime(); for(int zz=0;zz<round;zz++) { int ww=zz*commsize+myrank; if(ww>=NUM_TILE) break; process tile #ww } double finish = MPI_Wtime(); double tot_time = finish - start; double longest_tot_time; MPI_Allreduce(&tot_time, &longest_tot_time, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD); if (myrank==0) printf("longest_tot_time=%10.3f\n",longest_tot_time); MPI_Finalize(); # of nodes runtime(s) 1 60.7 2 31.3 4 17.9 8 10.2 16 7.6
19
Take Home Messages http://www-cs.ccny.cuny.edu/~jzhang/
We aim at designing novel data structures and algorithms that are efficient on new GPU parallel hardware to speed up processing of large-scale geospatial data. Experiments on zonal statistics using 20+ billion NASA SRTM 30 meter resolution DEM data and US county boundaries have demonstrated an impressive end-to-end runtime of 180s or better performance using a single commodity GPU device Additional experiments on OLCF Titan GPU cluster using 1-16 nodes have demonstrated good scalability of the proposed technique. The end-to-end runtime can be further reduced to 10s or better using 8 or more nodes interactive exploration of large-scale geospatial data becomes possible on high-end computing facilities It is possible to further improve the performance by fine-tuning relevant parameters; however, from application perspective, it is more important to understand the level of performance that end users can expect before investing money and time this study provides a concrete example Personal view: geospatial technologies have reached a tipping point due to hardware advances and changed cost models (parallel hardware, large-memory capacity and BigData driven application)there are great research/development/application opportunities to develop the next generation high-performance GIS/Spatial Databases on different hardware platforms and in different types of applications (e.g., embedding DBMS in CESM running on clusters)
20
Special Thanks to: Dali Wang (ORNL Host) (Ecosystem Simulation Science –ESD) Yaxing Wei (Environmental Science Data/Systems - ESD) Norbert Podhorszki (Scientific Data Group – CSM) Seyong Lee (Future Technology Group –CSM) Ben Mayer (Computational Earth Sciences –CSM) Matt Norman (Scientific Computing/NCCS) Anil Cheriyadat (GIST – CSE) Dilip Reddy Patlolla (GIST – CSE) Jiangye Yuan (GIST- CSE) Varun Chandola (GIST – CSE)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.