SEMILARITY JOIN COP6731 Advanced Database Systems.

Slides:



Advertisements
Similar presentations
1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
Advertisements

Nearest Neighbor Search
Spatial Join Queries. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
Spatial Join Yan Huang Spatial Join Given two sets of spatial data Find the pair of objects satisfying certain spatial predicate – e.g.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
A Simple and Efficient Algorithm for R-Tree Packing Scott T. Leutenegger, Mario A. Lopez, Jeffrey Edgington STR Sunho Cho Jeonghun Ahn 1.
Introduction to Spatial Database System Presented by Xiaozhi Yu.
39 1 Christian Böhm University for Health Informatics and Technology, Innsbruck Similarity Search and Data Mining: Database Techniques Supporting Next.
Query Processing in Databases Dr. M. Gavrilova.  Introduction  I/O algorithms for large databases  Complex geometric operations in graphical querying.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 2) Efficient Processing of Spatial Joins Using R-trees Rollo Chan Chu Chung Man Mak Wai Yip Vivian Lee Eric.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
Spatio-temporal Databases Time Parameterized Queries.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Spatial Queries Nearest Neighbor and Join Queries.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
Spatial Indexing I Point Access Methods.
Spatial Queries Nearest Neighbor Queries.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Spatial Indexing. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Indexing for Multidimensional Data An Introduction.
Christian Böhm & Hans-Peter Kriegel, Ludwig Maximilians Universität München A Cost Model and Index Architecture for the Similarity Join.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Spatial Database 2/5/2011 Reference – Ramakrishna Gerhke and Silbershatz.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Spatial Databases - Indexing
Query Processing CS 405G Introduction to Database Systems.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
STR: A Simple and Efficient Algorithm for R-Tree Packing.
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.
Spatial Data Management
Strategies for Spatial Joins
Database Management System
Spatial Indexing I Point Access Methods.
The Quad tree The index is represented as a quaternary tree
Query Processing in Databases Dr. M. Gavrilova
Database Query Execution
Multidimensional Indexes
Lecture 2- Query Processing (continued)
Database Design and Programming
CPS216: Advanced Database Systems
Lecture 13: Query Execution
B-Trees and Sorting Zachary G. Ives April 12, 2019
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Lecture 11: B+ Trees and Query Execution
Presentation transcript:

SEMILARITY JOIN COP6731 Advanced Database Systems

Basic Similarity Queries Range Query  Find similar items: k-Nearest-Neighbor (kNN) Query  Find the k most similar items:

Similarity Join  Given two sets, R and S, of data points  Find all pairs (r,s) є RxS, such that d(r,s) ≤ ε.  Applications - duplicate detection, similarity comparison, etc.

Similarity Join - SQL-like Notation SELECT* FROM R, S WHERE d(R.r, S.s) ≤ ε  є too small, no results  ε too large, very large result set

k-Closest Pair Query  Given two sets, R and S, of data points  Find those k (r,s) pairs that yield least distance  r and s are NN of each other  This is called distance join

k-Closest Pair Query SQL-like Notation SELECT* FROM R, S ORDER BYd(R.r, S.s) UNTILk Applications Find all pairs of people who have the most similar interests Find music scores which are most similar to each other

k-Nearest Neighbor Join  Combine each point with its k nearest neighbors from the other data set  SQL-like Notation: SELECT* FROMR, S GROUP BYR.r GROUP SIZEk ORDER BYd(R.r, S.s)

k-Nearest Neighbor Join

k-NN Join Applications  k-means clustering 1. k initial centers randomly selected 2. Assign each database point to its nearest center 3. Redetermine center for each cluster 4. Repeat Steps 2 and 3 until convergence  Classify new objects according to the majority of their k nearest neighbors

Nested Loop Join  Simple nested loop For each R-points, iterate over S-points Scan S |R| times, very expensive  Nested block loop For each page of R-points, iterate over S-points Scan S only |R|/|page| times, more cost effective

Indexed Nested Loop Join  For each R-point, determine matches in S using the index  For large number of dimensions and/or high selectivity (due to large ε), not as competitive as nested loop join

Spatial Join vs Similarity Join  Represent each data point as hypercube of edge-length 0.71·ε  Map similarity join wrt ε to spatial join on hypercubes If two hypercubes overlap, the corresponding points are within ε distance from each other That is, they are neighbors wrt ε

R-tree Spatial Join (RSJ)  Assumption: Index preconstructed on R and S with equal tree height Procedure RSJ (R, S: page) for each r є R.children do for each s є S.children do if (r  s ≠Φ) then RSJ(r,s);

Adapt RSJ for Similarity Join  Distance predicate rather than intersection  Mindist(R,S) computes least distance of two points in (R,S) Procedure RsimJ(R, S, ε) if IsDirPg(R) Λ IsDirPg(S) then for each r є R.children do for each s є S.children do if mindist(r,s) ≤ ε then RsimJ(r, s, ε); /* recursive */ else /* R & S are data pages */ for each p є R.points do for each q є S.points do if d(r, s) ≤ ε then output(p, q);

Performance Issues in R-tree Join  Cost dominated by point-distance computations - CPU-bound  Random page accesses can be worse than nested block loop join

Parallel Similarity Join  A task corresponds to a pair of tree nodes (data page or directory page)  Various task assignment strategies Round robin Static range assignment Dynamic task assignment to achieve load balancing

Breadth-First R-tree Join  Shortcoming of RsimJ Depth-first traversal is sequential in nature No strategy for improving locality in inner loop resulting in inefficient page access pattern  Solution Proceed level by level (i.e., breadth first traversal) Determine all relevant pairs for the next level Access these relevant pairs in the order of their physical locations in storage

Reducing Random Access in Breadth-First Traversal  Space is regularly tiled with a space filling curve (e.g., Hilbert curve) defined  Store the index tree level by level  For each level, store tree nodes according to their space-filling-curve order

Without Preconstructed Index (1)  Tree construction time often much less than join time - amortize during join  Indexes can be constructed temporarily for join  Techniques include Hilbert R-tree and ε- kdB tree Hilbert R-tree: Sort points by SFC, and pack adjacent points to page

Without Preconstructed Index (2) ε-kdB tree:  Space is partitioned into grid cells with grid line distance ε  Tree structure is specific to given ε, and must be constructed for each join leaf leaves ε root