Download presentation
Presentation is loading. Please wait.
Published byHilja Tuominen Modified over 6 years ago
1
SpatialHadoop: A MapReduce Framework for Spatial Data
Ahmed Eldawy and Mohamed F. Mokbel ICDE 2015 Presented by: Tin K. Vu Oct 20, 2016
2
Introduction Related works SpatialHadoop Architecture Experiments Conclusion & comments Future work
3
Introduction
4
Big Data
5
Spatial Data Satellites Smartphone Medical devices
6
Big Spatial Data
7
Big Spatial Data
8
Volume: data from hundreds of GB to TB. Velocity: fast data.
Big Spatial Data? Volume: data from hundreds of GB to TB. Velocity: fast data. Variety: data from different sources.
9
Support spatial queries.
Big Spatial Data? Distributed systems High scalability Support spatial queries.
10
Related works
11
Parallel-Secondo MD-HBase Hadoop-GIS Related works Parallel-Secondo
Hadoop is employed as a blackbox
12
SpatialHadoop Architecture
13
Full-fledged MapReduce framework. Native support for spatial data.
SpatialHadoop Full-fledged MapReduce framework. Native support for spatial data. Injects spatial data awareness in Hadoop.
14
Architecture
15
Storage layer
16
This slide is copied from a presentation of Prof. Ahmed in ICDE2015
17
Index structures: Grid, R-Tree, R+-Tree
Indexing process was done by MapReduce.
18
Index Building: Grid Apply for uniform data.
Number of partitions = size / block capacity
19
Index Building: R-Tree
Apply for skewed data. Partition process: Step 1: sampling to find partition boundaries.
20
Index Building: R-Tree
Apply for skewed data. Partition process: Step 1: Sampling to find partition boundaries. Step 2: Scan input file, insert each record to its partition. Step 3: Local indexing for each partition.
21
MapReduce layer MapReduce in Hadoop
22
MapReduce in SpatialHadoop
MapReduce layer MapReduce in SpatialHadoop
23
MapReduce layer Improvements when comparing to Hadoop?
Spatial File Splitter: exploits the global index by pruning non-relevant partitions. Spatial Record Reader: exploits local indexes by accessing records more efficiently.
24
Operations layer Basic operations Range query kNN Spatial join
Computational geometry operations Polygon Union Skyline Convex Hull Farthest/Closest pair Other operations could be added
25
Range query
26
Range query SpatialFileSplitter prunes blocks outside the query range.
27
Range query SpatialFileSplitter prunes blocks outside the query range.
SpatialRecordReader passes local indexes to the map function.
28
Range query SpatialFileSplitter prunes blocks outside the query range.
SpatialRecordReader passes local indexes to the map function. Map function selects records in range.
29
kNN
30
kNN SpatialFileSplitter selects the block that contains the query point.
31
kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block.
32
kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block. Check result.
33
kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block. Check result.
34
kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block. Check result. Revise result.
35
Language layer: Pigeon
Hides the complexity of the system with a high level language Extends Pig Latin with OGC-compliant primitives Spatial data types (e.g., Polygon) Basic operations (e.g., Area) Spatial predicates (e.g., Touches) Spatial analysis (e.g., Union) Spatial aggregate functions (e.g., Convex Hull) This slide is copied from a presentation of Prof. Ahmed in ICDE2015
36
Spatial Data types Load a file with spatial attributes
Perform primitive operations This slide is copied from a presentation of Prof. Ahmed in ICDE2015
37
Spatial operations Range query kNN Spatial join
This slide is copied from a presentation of Prof. Ahmed in ICDE2015
38
Experiments
39
Experiments Goal: Evaluate the scalability and efficiency of SpatialHadoop compared to traditional Hadoop. Hardware: Amazon EC2 cluster of up to 100 nodes (default is 20 nodes). Datasets TIGER files (US Map) with up 60 GB (70M polygons). OpenStreetMap data of up 70GB (164M polygons). Generated data of up to 128GB (2 Billion rectangles). Satellite data of up to 4.6 TB (120 Billion points).
40
Performance with query size on TIGER data)
41
Scalability with input size (Generated Data)
42
Spatial Join performance with TIGER files
43
Indexing time with satellite data
44
Conclusion & comments
45
Support big spatial data with a distributed, large-scale system.
Conclusion Support big spatial data with a distributed, large-scale system. Overcome limitations of previous systems. Provide efficient tools to work with spatial data.
46
Comments SpatialHadoop only support static data. Indexing process must be executed before other tasks. Data types may be more complex: check-in data (complex point), road network with traffic jams (complex polygon). Thus, indexing process may depend on additional data instead of primitive data types.
47
Future work
48
Dynamic indexing for spatial data.
Future work Dynamic indexing for spatial data. Support other demands: location-based search, routing with cost...
49
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.