Presented by: Omar Alqahtani Fall 2016

Presented by: Omar Alqahtani Fall 2016
LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data Presented by: Omar Alqahtani Fall 2016

Paper Information Authors: Publication: VLDB 2016 Type: Demo Paper

Outlines What is LocationSpark? Related Work
Overview and features of LocationSpark Data model and data type Spatial Queries Query Scheduler Query Executor Spatial Indexing Memory Management Evaluation

What is LocationSpark ? LocationSpark is a spatial data processing system built on top of Apache Spark. It provides spatial query APIs on top of the standard dataflow operators. Why don’t we use Spark directly? Well, The lack of spatial indexing. The inability to handle spatial data skew. The lack of spatial query optimization. unnecessary network communication due to spatial data overlap.

Cont. To achieve performance speedup on Spark, they introduce a range of optimizations: Global and immutable local spatial indexes over in-memory data e.g., Grid, R-tree, Quadtree, and IR- tree, to support ecient spatial Automatic skew analyzer and handler Using of bloom filter to reduce the communication cost.

Related Work Two categories:
Systems using Hadoop such as [ 4, 13 ], Hadoop-GIS, SpatialHadoop, MD-Hbase, But, Hadoop MapReduce has to write intermediate data into HDFS. Systems using Apache Spark such as GeoSpark, SpatialSpark, Magellan, and GeoTrellis. Mostly suffered from query skew, and excessive and unoptimized network and I/O communication.

Data Model and Data Types
LocationSpark stores spatial data as key-value pairs. Key can be a two-dimensional point, a line-segment, a poly-line, a rectangle, or a polygon. The value type can be specified by the user such as a text. Spatial queries: It supports spatial range, spatial kNN, spatial-join, and kNN-join. It provides analysis functions including spatial data clustering, spatial data skyline computation and spatio-textual topic summarization.

Query Scheduler Skew is a major issue in spatial data.
Focusing on two types of skewness: Unbalanced data partitioning: solved by spatial indexes. Query skew: solved by query scheduler. How ? Dynamically collecting statistical information from each partition ( # of queries ) A cost model is used to evaluate the overhead of repartitioning the hotspot partitions. It can choose a set of partitions to be further reallocated to workers with an affordable cost.

Overview and features of LocationSpark Data model and data type Spatial Queries Query Scheduler Query Executor: I DIDN’T GET IT!!!!!!!!!!!!!!!!!! Spatial Indexing Memory Management Evaluation

Spatial Indexing Global Index Local Index
It, first, samples the data to learn the distribution. Then, it builds the global index. Grid and region quadtree are used for global index. The type of the local index can be specified by users. It offers grid local index, an R-tree, a variant of the quadtree, or an IR-tree.

Cont. To support data update, each version of spatial index can be persistent to disk for fault- tolerance. Thus, these spatial indexes are immutable and are implemented based on the path copy approach.

Spatial Bloom Filter Bloom filter, in general, is a probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.* They embedded into the global spatial index a spatial bloom filter which can answer whether a spatial point is contained inside a spatial range or not. HOW ? *

Memory Management It dynamically caches frequently accessed data into memory, and stores the less frequently used data into disk. How? Access frequencies and corresponding time stamps are recorded in the spatial index. Then, it aggregates access frequencies.

Evaluation Two real spatial datasets are used:
Twitter dataset, gathered from January 2013 to July 2014). the size is 250 GB. OpenStreetMap Contains spatial object with its coordinates (longitude, latitude) and an object ID. It contains 1.7 Billion points and takes 62.3 GB of disk space. Experiments done on a cluster that consists of : 6 Dell compute nodes with two 8-core Intel E5-2650v2 CPUs, 32 GB of memory, 48TB of local storage per node. It has Spark with Yarn cluster.

Cont. Experiments compare the performance of LocationSpark with GeoSpark, and SpatialSpark.

Questions

Presented by: Omar Alqahtani Fall 2016

Similar presentations

Presentation on theme: "Presented by: Omar Alqahtani Fall 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by: Omar Alqahtani Fall 2016

Similar presentations

Presentation on theme: "Presented by: Omar Alqahtani Fall 2016"— Presentation transcript:

Similar presentations

About project

Feedback