On Spatial Joins in MapReduce

Slides:

Advertisements

Similar presentations

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Advertisements

CS4432: Database Systems II

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.

Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

SpatialHadoop: A MapReduce Framework for Spatial Data

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Spatial Data Management

Presented by: Omar Alqahtani Fall 2016

Big Data is a Big Deal!.

Chapter 7. Classification and Prediction

Hadoop Aakash Kag What Why How 1.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

MapReduce Types, Formats and Features

Scaling SQL with different approaches

A paper on Join Synopses for Approximate Query Answering

TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.

Database Management Systems (CS 564)

Chapter 12: Query Processing

SpatialHadoop: A MapReduce Framework for Spatial Data

Dynamic Indexing in SpatialHadoop

Evaluation of Relational Operations

Sameh Shohdy, Yu Su, and Gagan Agrawal

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Database Management Systems (CS 564)

Evaluation of Relational Operations: Other Operations

Spatial Online Sampling and Aggregation

Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan

Genetic Improvement CHORDS GROUP Stirling University

MapReduce: Data Distribution for Reduce

Communication and Memory Efficient Parallel Decision Tree Construction

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

CS110: Discussion about Spark

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Selected Topics: External Sorting, Join Algorithms, …

Chapters 15 and 16b: Query Optimization

Data processing with Hadoop

Lecture 2- Query Processing (continued)

Yi Wang, Wei Jiang, Gagan Agrawal

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

MAPREDUCE TYPES, FORMATS AND FEATURES

Evaluation of Relational Operations: Other Techniques

COMP755 Advanced Operating Systems

Evaluation of Relational Operations: Other Techniques

Map Reduce, Types, Formats and Features

Presentation transcript:

On Spatial Joins in MapReduce Ibrahim Sabek & Mohamed F. Mokbel Presented by: Achyuth Madhav Diwakar

Problem Statement and Motivation To develop a query optimizer for MapReduce based spatial join algorithms. Traditional spatial join optimizations NOT suitable for MapReduce based algorithms

Related Work Traditional Spatial Join Algorithms: Numerous algorithms categorized based on indexing of inputs. Non-indexed, one dataset indexed and both indexed Several benchmarking studies performed to find the best algorithm for the given inputs.

Related Work Big Spatial Data: Research towards supporting range-queries, KNN, Spatial joins and computational geometry

Related Work MapReduce based Spatial Join Algorithms Standalone algorithms or part of spatial data engines like: Hadoop GIS or Spatial Hadoop Developed for targeted scenarios Spatial Hadoop assumes both datasets are spatially partitioned while other algorithms assume non-partitioned datasets. Standard Approach: First partition and then join using any of the algorithms. No query optimization efforts done at the time of this research

MapReduce - WordCount Splits data Generates Key-value pairs by mapping Sorts the pairs by key Reduces the pairs with same key into key, frequency pairs

Hadoop & Spatial Hadoop Involves partitioning of data Join performed in the reduce phase by looking for overlapping MBRs Can perform global joins and local joins SpatialHadoop: A MapReduce Framework for Spatial Data∗ Ahmed Eldawy Mohamed F. Mokbel

Spatial Join with MapReduce Three cases based on input partitioning R & S are two datasets stored in HDFS Output: (r,s), r ∈ R and s ∈ S where r and s satisfy some spatial predicate

Partitioning Phase To ensure that the HDFS blocks of both input files R and S are partitioned in a way that each block includes data that are spatially related Performance depends on minimizing the number of partitions joined Decision points perform that role

Case 1: No Spatial Partitioning Default Hadoop HDFS partitioning: Fixed block sizes (say 128 MB) with no regard to spatial data. Considered Non – partitioned Partitioning Phase: Should avoid nested loop joins! Partitions the two input files R and S using the same partitioning scheme P Both datasets must be taken together by size – proportional sampling and one in memory index P is built Fire stations Houses Fig. from Dr. Magdy’s slides on spatial queries

Case 1: No Spatial Partitioning Decision Point 1: Partitioning Options Option1:Grid: Creates equal sized cells Option2:QuadTree: in-memory quadtree using the sampled data. Option 3: KDTree: constructs an in- memory KDTree using sampled data. Option 4: STR: builds the in-memory index by bulk loading an R-tree using the STR bulk loading algorithm Option 5: STR+. This technique is similar to STR, except that objects that overlap with multiple cells are replicated.

Case 1: No Spatial Partitioning Joining Phase: R & S will have the same MBRs. Therefore, one-to-one join can be performed. Duplicate avoidance technique to ensure only one pair of joined objects

Spatial Join with MapReduce

Case 2: One Spatial Partition Partitioning Phase: Decision Point 2: To ignore R-partitioning and repartition both like Case 1? OR Partition just S on the same scheme as R Options: Repartition R & S using options from Case 1 Partition S using the boundaries of R

Case 2: One Spatial Partition Joining Phase: All options result in same partitioning scheme for R & S One-to-one join on corresponding partitions

Spatial Join with MapReduce

Case 3: Two Spatial Partitions Partitioning Phase: Decision Point 3: To ignore R-partitioning and repartition both like Case 1? OR Partition just S on the same scheme as R OR Partition just R on the same scheme as S OR Skip partitioning and join directly

Case 3: Two Spatial Partitions Partitioning Phase: Options: Repartition R & S using options from Case 1 Results in one-to-one join Partition S using the boundaries of R OR vice versa from Case 2 Perform join directly Could be one-to-one or one-to-many join

Case 3: Two Spatial Partitions Joining Phase: Direct join shown here One-to-many join occurs

Optimizer Flavors Aims to find the least cost approach to do a spatial join for two input datasets R and S Evaluates the cost of all possible options for the three decision points Cost based Optimizer: abstracts the developed cost model from cost based approach to three simple easy-to-check heuristic rules. Rule based Optimizer:

Cost Optimizer Cost of Spatial Join Cost of Partitioning = Cost of Partitioning + Cost of Joining Cost of Spatial Join = Cost of estimating the partitioning boundaries + Actual cost of partitioning Cost of Partitioning = cost of moving the pairs of overlapping partitions into the same node + Actual cost of joining Cost of Joining

Evaluating Cost in Case 1 Estimation Cost: ZERO for grid Similar for all other options, calculated experimentally Partitioning Cost: Depends on the number of spatial partitions Number of partitions determined by: No. of partitions for larger input Block Utilization: HDFS block util. as a range between 0 and 1 Data Replication: Avg. number of partitions that input ‘r’ would overlap with Number of allocated partitions = 2NRURDp Costs divided by KxH which is the number of running Map tasks Decision Point 1: Partitioning Options Option1:Grid Option2:QuadTree Option 3: KDTree Option 4: STR Option 5: STR+

Evaluating Cost in Case 1 Preparation Cost: cost of shuffling the new spatial partitions of R and S such that all pairs of joined partitions are in the same machine One-to-one join: need to shuffle only the partitions of one of these relations, say R Joining Cost: need to do an in-memory one-to- one join for all the spatial partitions in each input dataset Costs divided by LxH which is the number of running Reduce tasks

Evaluating Cost in Case 2 Estimation Cost: ZERO as partitions of R are being used Partitioning Cost: Calculated in a similar fashion as Case 1 Cost is halved as only one dataset is being partitioned Costs divided by KxH which is the number of running Map tasks

Evaluating Cost in Case 2 Preparation Cost: Similar to Case 1, except we know the exact number of partitions to move (Nr) Joining Cost: In memory one-to-one join Costs divided by KxH which is the number of running Map tasks

Evaluating Cost in Case 3 Partitioning Cost: ZERO as there is no partitioning Joining Cost: Here the number of overlapping partitions is estimated and a selectivity estimation is multiplied to the Nr X Ns. Selectivity estimation is a factor from 0 to 1

Rule Based Query Optimizer Cost-based optimizer is accurate but expensive as it collects and maintains statistics as well as estimates parameters to calculate its cost equations Rule-based optimizer has thresholds which determine the partitioning decisions: Rule 1: Nr and Ns are less than a given threshold t, then we partition both R and S on Grid (Option 1), otherwise, we partition both R and S on STR (Option 4) Rule 2: Nr and Ns are less than a given threshold t, then we partition both R and S on Grid (Option 1), otherwise, we partition S on the same partitioning scheme of R (Option 6)

Rule based Query Optimizer Rule 3: if βRSNS ≤ 1 where ‘beta’ is the selectivity estimation and R is the larger dataset then partition is skipped If Nr and Ns are both less than threshold ‘t’ then Grid partitioning is used If both the above conditions fail, Case2 optimization is used where, S is partitioned on the scheme of R

Experiments Estimation of parameters for Cost Model Map Reduce Parameters Datasets. All experiments are based on collected data from Open- StreetMap

Experiments Significantly outperforms AsterixDB as it is not designed for Spatial join optimizations Buildings and Roads: AsterixDB selects STR type optimization whereas Cost- Optz selects Grid type optimization which is better for uniform distribution Rule-Optz also selects STR but performs better due to SpatialHadoop efficiency

Experiments Input Data Characteristics Option1:Grid Option2:QuadTree Option 3: KDTree Option 4: STR Option 5: STR+

Experiments Uniform vs. Skewed Datasets

Conclusions Delivered a full-fledged query optimizer for MapReduce-based spatial join algorithms Covers extensive cases through a thorough classification detailed cost model is developed for each possible case provided a rule-based query optimizer that abstracts the developed cost model Exhaustive experiments based on a real deployment inside SpatialHadoop, and based on real datasets of up to 500GB

Thank You!