Selectivity Estimation of Big Spatial Data

Selectivity Estimation of Big Spatial Data
Zacharias (Harry) Chasparis CS 267: New Trends in Database Systems

Wait… What ? ? ? Initial project: Visualization of Big Spatial
Data on Spark New project: Executable in 3 weeks Limited machine needs

Outline Introduction Selectivity Estimation Problem Existing Solutions
Current Project Implementation Experimental evaluation Future Steps

Big Spatial Data Information about the locations and shapes of geographic features and the relationships between them Huge amount of data produced by several devices Major need of storing, processing and visualizing

Selectivity Estimation problem
Range query  the number of elements in the dataset that lies in the given range Big amount of data Need of storage capacity Need of huge process power

Selectivity Estimation problem

Existing solutions Very old proposals that may cannot apply on the current size and/or format of data Most solutions do not take into account the memory provided Implemented exclusive for big clusters Focus mainly either to the quality, or to the performance Most important No study is clear on which selectivity estimation method works better The limitations of these methods

Project Main idea: Methods Make all the process in memory
Implementation and mainly comparison of 2 different methods to solve the selectivity estimation problem Methods Sampling Binning Make all the process in memory 2 main aspects to consider Memory budget – Selectivity ratio

1st part – Implementation

Sampling Way to sample: Sample depends on the memory budget
Assign a random number to each record Keep the N greatest Sample depends on the memory budget Each record 3*8bytes = 24bytes For memory budget = 10mb  10mb/24bytes = 416,667 sample records Pros: Read one time the whole dataset from disk Pros: We have actual records on memory Cons: Gives estimation of the result

Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 X  X 
9 7   [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 X 
9 7 1   [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 X 
9 7 1 3   [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 8
X  X  y y 1 2 5 4 3 6 8 9 7 1 3 8   [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 8 12
X  X  y y 1 2 5 4 3 6 8 9 7 1 3 8 12 13 19 24 29 15 34 39 44 9 25 51 64 74 33 66 82 94   [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

Binning 1st step: Split the grid
Splitting depends on memory Each record 8bytes For memory budget = 10mb  10mb/8bytes = 1,250,000 bins 2nd step: Compute the number of points in each grid Pros: Gives very good estimation result Cons: Need of read twice the dataset from disk

2nd part - Experiments

Experiments Measuring: Experimental setup Accuracy/Quality Performance
2 datasets 96.4gb (2,682,401,763 records) 424mb (10,507,403 records) Windows 7-64bit i7 2nd generation 8gb RAM External disk – read speed ≃ 75mb/sec

Range query Selectivity ratio A point For:
what portion of the whole data to count A point compute the square around it For: selectivity = 1% total area = 100 central point = (5,4) selectivity = edge/total_area 1 2 3 4 5 6 7 8 9 10

Comparison Compare the 2 different approaches
4 experimental results for both approaches Accuracy/Quality Performance (time) with specific selectivity and different memory budget with specific memory budget and different selectivity

Possible future steps Compare with more techniques
Use bigger memory budget Parallelize the execution Possible on Hadoop, Spark

Summary Problem of selectivity Course Project Methods been used
In progress experiments Improvements

Selectivity Estimation of Big Spatial Data

Similar presentations

Presentation on theme: "Selectivity Estimation of Big Spatial Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Selectivity Estimation of Big Spatial Data

Similar presentations

Presentation on theme: "Selectivity Estimation of Big Spatial Data"— Presentation transcript:

Similar presentations

About project

Feedback