AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou.

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Chapter 2: Memory Management, Early Systems

Chapter 2: Memory Management, Early Systems

Memory Management, Early Systems

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.

A Generic Framework for Monitoring Continuous Spatial Queries over Moving Objects.

Memory Management Chapter 7.

A. Frank - P. Weisberg Operating Systems Advanced CPU Scheduling.

Indexing Network Voronoi Diagrams*

Chapter 7 Memory Management

Accessing Spatial Data

I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.

Chapter 3: Data Storage and Access Methods

Spatial Indexing I Point Access Methods.

Assets and Dynamics Computation for Virtual Worlds.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

FLANN Fast Library for Approximate Nearest Neighbors

Web Server Load Balancing/Scheduling Asima Silva Tim Sutherland.

UCSC 1 Aman ShaikhICNP 2003 An Efficient Algorithm for OSPF Subnet Aggregation ICNP 2003 Aman Shaikh Dongmei Wang, Guangzhi Li, Jennifer Yates, Charles.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.

Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.

Processing Monitoring Queries on Mobile Objects Lecture for COMS 587 Department of Computer Science Iowa State University.

GeoGrid: A scalable Location Service Network Authors: J.Zhang, G.Zhang, L.Liu Georgia Institute of Technology presented by Olga Weiss Com S 587x, Fall.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.

Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?

A Flexible Spatio-temporal indexing Scheme for Large Scale GPS Tracks Retrieval Yu Zheng, Longhao Wang, Xing Xie Microsoft Research.

Presented by: Omar Alqahtani Fall 2016

Memory Management.

Data Mining Soongsil University

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Adaptive Cache Partitioning on a Composite Core

Parallel Databases.

Spatial Indexing I Point Access Methods.

So far we have covered … Basic visualization algorithms

Load Weighting and Priority

SpatialHadoop: A MapReduce Framework for Spatial Data

Selectivity Estimation of Big Spatial Data

MapReduce Simplied Data Processing on Large Clusters

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Predictive Performance

On Spatial Joins in MapReduce

CPU Scheduling G.Anuradha

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

Distributed Probabilistic Range-Aggregate Query on Uncertain Data

Continuous Density Queries for Moving Objects

Parallel Feature Identification and Elimination from a CFD Dataset

Efficient Aggregation over Objects with Extent

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

AQWA Adaptive Query-Workload-Aware Partitioning of Big Spatial Data Dimosthenis Stefanidis Stelios Nikolaou

Introduction 1.Spatial Data The data or information that identifies the geographic location (coordinates) of features and boundaries on Earth. Due to the big amount of location-aware devices large amount of spatial information are created every day. 2

Introduction 2. What is AQWA? AQWA(adaptive query-workload-aware) is a data partitioning mechanism that minimizes the query processing time of spatial queries. AQWA updates the partitioning according to: 1. the data changes(SpatialHadoop require recreating the partitions) 2. the query workload 3

Introduction Keeps a lower bound on the size of each partition ( traditional spatial index structures has unbounded decomposition) Allowing too many small partitions can be harmful to the overall health of a computing cluster. 4

Introduction AQWA supports spatial range and kNN queries. AQWA presents a more efficient approach that guarantees the correctness of evaluation of kNN queries through a single round of computation while minimizing the amount of data to be scanned during processing the query(Existing approaches for kNN require two rounds of processing). AQWA can react to different types of query workloads (i.e. multiple hotspots). 5

Main Goal Partitioning the data in a way that minimizes that cost L = set of partitions Oq(p) = the number of queries that overlap Partition p N(p) = the count of points in p 6

Split Queue(Priority Queue) We maintain a priority queue of candidate partitions. Partitions in the split-queue are decreasingly ordered in a max-heap according to the cost reduction that would result after the split operations. 7

K-d tree Space-partitioning data structure for organizing points in a k- dimensional space 8

Prefix Sum 9

Initialization 10

Initialization 1.Divide the space into a grid(G[i,j]) which contains the total number of points whose coordinates are inside the boundaries. 2.A K-d tree decomposition (recursively) to identify the best partition layout. Partition the data in a way that balances the number of points across the partitions. 3.A MapReduce job reads the entire data and assigns each data point to its corresponding partition (creates the partition). 11

Efficient Search via Aggregation 12 1)perform horizontal aggregation 2)perform vertical aggregation

Query Execution 13

Query Execution 1. Select the partitions that are relevant to the invoked query. 2. Selected partitions are passed as input to a MapReduce job to determine the actual data points that belong to the answer of the query. 3.We may take a decision to repartition the data. If so update their corresponding values in the priority queue. 14

Data repartition after a Query 15 Three factors affect this decision: 1.The cost gain that would result after splitting a partition. 2.The overhead of reading and writing the contents of these partitions. 3.The sizes of the resulting partitions.

Splitting Partitions 16 A  20 points E  30 points D  15points Cost is 20 × × × 1 = 65 E  split to E1 and E2 E1, E2 have 15 points each  New cost = 50

Should we decide to split a partition? Decreased Cost = C(E) − C(E1)*q − C(E2)*q Cost for read/write: C rw (E) = 2 × N(E) N(E) is the number of points in E Sizes of resulting partitions: N(E1) > minCount N(E2) > minCount Where minCount=block size/(number of bytes of a data point) Decreased cost > Crw 17

Time-Fading Weights AQWA keeps the history of all queries that have been pro-cessed and it maintains counts in grid G, for the old queries (C old, received in the last T time units) and the current queries(C new, received before T time units) (T parameter = time-fading cycle). Every T time units, C old gets divided by c(c > 1) and C new is added to C old(then C new is set to zero). Number of queries in a region: C new + C old 18

Time-Fading Weights It process the partitions in a round-robin cycle and it process only Np/T partitions every T(Np = number of partitions). For each of the Np/T partitions, we recalculate the cost and reinsert these partitions into the split-queue. 19

Data Acquisition 20

Data Acquisition Issue a MapReduce job that appends each new data point to its corresponding partition according to the current layout of the partitions. The counts of points in the grid are incremented according to the corresponding counts in the given batch of data. 21

Support for kNN Queries The boundaries that contain the answer of the query are unknown until the query is executed. The spatial region that contains the answer of a kNN query depends on: 1.the value of k 2.the location of the query focal point 3.the distribution of the data 22

Support for kNN Queries 23 1.Scan the grid cells from the query focal point, and count the number of points in the encountered cells. 2.Once the accumulative count reaches the value k, we mark the largest distance between the query focal point and any encountered cell. 3.A kNN query is treated as a range query once the rectangular bounds enclosing the answer are determined.

Concurrency Control Problem: It is possible that while a partition is being split, a new query is received that may also trigger another split to the very same partitions being altered. Solution: Simple locking mechanism Whenever a query(q) triggers a split q tries to acquire a lock on each of the partitions to be altered. If q succeeds to acquire all the locks then q is allowed to alter the partitions. Locks are released after the partitions are completely altered. If q cannot acquire the locks then the decision to alter the partitions is cancelled. 24

Experiments 7-node cluster: running Hadoop 2.2 over Red Hat Enterprise Linux 6. Node in cluster: Dell r720xd server 16 Intel E5-2650v2 cores Memory: 64 GB Local Storage: 48 B Ethernet interconnect: 40 Gigabit Data size: Small scale(250 GB) & Large scale (2.5 TB) We choose the k-d and grid-based partitioning as our baselines because this allows us to contrast AQWA against two different extreme partitioning schemes: 1) pure spatial decomposition(uniform grid) and 2) data decomposition (k-d tree) 25

Initialization 26 AQWA is the same with k-d tree

Range Query Performance 27 The system throughput indicates the number of queries that can be answered per unit time. The split overhead indicates the time required to perform the split operations.

kNN Query Performance 28 The performance of kNN queries for different values of k.

Handling Multiple Query-Workloads 29 In this set of experiments, we study the effect of having two or more query-workloads. In this mode, we simulate the migration of the workload from one hotspot to another.

Conclusion In AQWA we addressed several performance and system challenges: The limitations of Hadoop The overhead of rebuilding the partitions in HDFS The dynamic nature of the data where new batches are created every day The issue of workload-awareness where the query workload can change 30