SpatialHadoop: A MapReduce Framework for Spatial Data

SpatialHadoop: A MapReduce Framework for Spatial Data
Ahmed Eldawy and Mohamed F. Mokbel ICDE 2015 Presented by: Tin K. Vu Oct 20, 2016

Introduction Related works SpatialHadoop Architecture Experiments Conclusion & comments Future work

Introduction

Big Data

Spatial Data Satellites Smartphone Medical devices

Big Spatial Data

Volume: data from hundreds of GB to TB. Velocity: fast data.
Big Spatial Data? Volume: data from hundreds of GB to TB. Velocity: fast data. Variety: data from different sources.

Support spatial queries.
Big Spatial Data? Distributed systems High scalability Support spatial queries.

Related works

Parallel-Secondo MD-HBase Hadoop-GIS Related works Parallel-Secondo
Hadoop is employed as a blackbox

SpatialHadoop Architecture

Full-fledged MapReduce framework. Native support for spatial data.
SpatialHadoop Full-fledged MapReduce framework. Native support for spatial data. Injects spatial data awareness in Hadoop.

Architecture

Storage layer

This slide is copied from a presentation of Prof. Ahmed in ICDE2015

Index structures: Grid, R-Tree, R+-Tree
Indexing process was done by MapReduce.

Index Building: Grid Apply for uniform data.
Number of partitions = size / block capacity

Index Building: R-Tree
Apply for skewed data. Partition process: Step 1: sampling to find partition boundaries.

Index Building: R-Tree
Apply for skewed data. Partition process: Step 1: Sampling to find partition boundaries. Step 2: Scan input file, insert each record to its partition. Step 3: Local indexing for each partition.

MapReduce layer MapReduce in Hadoop

MapReduce in SpatialHadoop
MapReduce layer MapReduce in SpatialHadoop

MapReduce layer Improvements when comparing to Hadoop?
Spatial File Splitter: exploits the global index by pruning non-relevant partitions. Spatial Record Reader: exploits local indexes by accessing records more efficiently.

Operations layer Basic operations Range query kNN Spatial join
Computational geometry operations Polygon Union Skyline Convex Hull Farthest/Closest pair Other operations could be added

Range query

Range query SpatialFileSplitter prunes blocks outside the query range.

Range query SpatialFileSplitter prunes blocks outside the query range.
SpatialRecordReader passes local indexes to the map function.

Range query SpatialFileSplitter prunes blocks outside the query range.
SpatialRecordReader passes local indexes to the map function. Map function selects records in range.

kNN SpatialFileSplitter selects the block that contains the query point.

kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block.

kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block. Check result.

kNN SpatialFileSplitter selects the block that contains the query point. Map function performs kNN in the selected block. Check result. Revise result.

Language layer: Pigeon
Hides the complexity of the system with a high level language Extends Pig Latin with OGC-compliant primitives Spatial data types (e.g., Polygon) Basic operations (e.g., Area) Spatial predicates (e.g., Touches) Spatial analysis (e.g., Union) Spatial aggregate functions (e.g., Convex Hull) This slide is copied from a presentation of Prof. Ahmed in ICDE2015

Spatial Data types Load a file with spatial attributes
Perform primitive operations This slide is copied from a presentation of Prof. Ahmed in ICDE2015

Spatial operations Range query kNN Spatial join
This slide is copied from a presentation of Prof. Ahmed in ICDE2015

Experiments

Experiments Goal: Evaluate the scalability and efficiency of SpatialHadoop compared to traditional Hadoop. Hardware: Amazon EC2 cluster of up to 100 nodes (default is 20 nodes). Datasets TIGER files (US Map) with up 60 GB (70M polygons). OpenStreetMap data of up 70GB (164M polygons). Generated data of up to 128GB (2 Billion rectangles). Satellite data of up to 4.6 TB (120 Billion points).

Performance with query size on TIGER data)

Scalability with input size (Generated Data)

Spatial Join performance with TIGER files

Indexing time with satellite data

Conclusion & comments

Support big spatial data with a distributed, large-scale system.
Conclusion Support big spatial data with a distributed, large-scale system. Overcome limitations of previous systems. Provide efficient tools to work with spatial data.

Comments SpatialHadoop only support static data. Indexing process must be executed before other tasks. Data types may be more complex: check-in data (complex point), road network with traffic jams (complex polygon). Thus, indexing process may depend on additional data instead of primitive data types.

Future work

Dynamic indexing for spatial data.
Future work Dynamic indexing for spatial data. Support other demands: location-based search, routing with cost...

Thank you!

SpatialHadoop: A MapReduce Framework for Spatial Data

Similar presentations

Presentation on theme: "SpatialHadoop: A MapReduce Framework for Spatial Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SpatialHadoop: A MapReduce Framework for Spatial Data

Similar presentations

Presentation on theme: "SpatialHadoop: A MapReduce Framework for Spatial Data"— Presentation transcript:

Similar presentations

About project

Feedback