Presentation is loading. Please wait.

Presentation is loading. Please wait.

Selectivity Estimation of Big Spatial Data

Similar presentations


Presentation on theme: "Selectivity Estimation of Big Spatial Data"— Presentation transcript:

1 Selectivity Estimation of Big Spatial Data
Zacharias (Harry) Chasparis CS 267: New Trends in Database Systems

2 Wait… What ? ? ? Initial project: Visualization of Big Spatial
Data on Spark New project: Executable in 3 weeks Limited machine needs

3 Outline Introduction Selectivity Estimation Problem Existing Solutions
Current Project Implementation Experimental evaluation Future Steps

4 Big Spatial Data Information about the locations and shapes of geographic features and the relationships between them Huge amount of data produced by several devices Major need of storing, processing and visualizing

5 Selectivity Estimation problem
Range query  the number of elements in the dataset that lies in the given range Big amount of data Need of storage capacity Need of huge process power

6 Selectivity Estimation problem

7 Existing solutions Very old proposals that may cannot apply on the current size and/or format of data Most solutions do not take into account the memory provided Implemented exclusive for big clusters Focus mainly either to the quality, or to the performance Most important No study is clear on which selectivity estimation method works better The limitations of these methods

8 Project Main idea: Methods Make all the process in memory
Implementation and mainly comparison of 2 different methods to solve the selectivity estimation problem Methods Sampling Binning Make all the process in memory 2 main aspects to consider Memory budget – Selectivity ratio

9 1st part – Implementation

10 Sampling Way to sample: Sample depends on the memory budget
Assign a random number to each record Keep the N greatest Sample depends on the memory budget Each record 3*8bytes = 24bytes For memory budget = 10mb  10mb/24bytes = 416,667 sample records Pros: Read one time the whole dataset from disk Pros: We have actual records on memory Cons: Gives estimation of the result

11 Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 X  X 
9 7 [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

12 Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 X 
9 7 1 [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

13 Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 X 
9 7 1 3 [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

14 Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 X 
9 7 1 3 [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

15 Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 8
X  X  y y 1 2 5 4 3 6 8 9 7 1 3 8 [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

16 Binning Way for binning Inspired by AQWA[1] 1 2 5 4 3 6 8 9 7 1 3 8 12
X  X  y y 1 2 5 4 3 6 8 9 7 1 3 8 12 13 19 24 29 15 34 39 44 9 25 51 64 74 33 66 82 94 [1] Ahmed M. Aly et al: AQWA: Adaptive Query-Workload-Aware Partitioning of Big Spatial Data. PVLDB 8(13): (2015)

17 Binning 1st step: Split the grid
Splitting depends on memory Each record 8bytes For memory budget = 10mb  10mb/8bytes = 1,250,000 bins 2nd step: Compute the number of points in each grid Pros: Gives very good estimation result Cons: Need of read twice the dataset from disk

18 2nd part - Experiments

19 Experiments Measuring: Experimental setup Accuracy/Quality Performance
2 datasets 96.4gb (2,682,401,763 records) 424mb (10,507,403 records) Windows 7-64bit i7 2nd generation 8gb RAM External disk – read speed ≃ 75mb/sec

20 Range query Selectivity ratio A point For:
what portion of the whole data to count A point compute the square around it For: selectivity = 1% total area = 100 central point = (5,4) selectivity = edge/total_area 1 2 3 4 5 6 7 8 9 10

21 Comparison Compare the 2 different approaches
4 experimental results for both approaches Accuracy/Quality Performance (time) with specific selectivity and different memory budget with specific memory budget and different selectivity

22 Possible future steps Compare with more techniques
Use bigger memory budget Parallelize the execution Possible on Hadoop, Spark

23 Summary Problem of selectivity Course Project Methods been used
In progress experiments Improvements

24


Download ppt "Selectivity Estimation of Big Spatial Data"

Similar presentations


Ads by Google