IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
        iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Multidimensional Indexing
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Introduction to Spatial Database System Presented by Xiaozhi Yu.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
Modern Information Retrieval
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges.
Chapter 8 File organization and Indices.
Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.
High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Indexing Techniques Mei-Chen Yeh.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.
IMAGE DATABASES Prof. Hyoung-Joo Kim OOPSLA Lab. Computer Engineering Seoul National University.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Indexing Multidimensional Data
Spatial Data Management
Updating SF-Tree Speaker: Ho Wai Shing.
Multidimensional Access Structures
Christian Böhm, Bernhard Braunmüller, Florian Krebs, and Hans-Peter Kriegel, University of Munich Epsilon Grid Order: An Algorithm for the Similarity.
Spatial Indexing I Point Access Methods.
The Quad tree The index is represented as a quaternary tree
Introduction to Query Optimization
2018, Spring Pusan National University Ki-Joune Li
Similarity Search: A Matching Based Approach
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

iMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT- SIGART 19th Symposium on Principles of Database Systems (PODS), , (2000).

Window/Range query: Retrieve data points fall within a given range along each dimension. Designed to support range retrieval, facilitate joins and similarity search (if applicable). Query Requirement

Increase Fan-Out by increasing the node size, as in X-trees. --- reduce indexing to semi-sequential scan? Increase Fan-Out by using approximation in the nodes, as in A-trees – expensive update. Mapping high-dimensional points into single-dimensional points by using space-filling curves methods. –Transformation is expensive. –The number of sub-queries generated can be very big. Sequential scan needs to search the whole data file -- affected volume is 100%. Strategies

Pyramid T. leads to 2d sub-queries on B +- tree. 2-dimensional example Pyramid Pyramid Technique

Map objects to approximate bit strings. Eg. (0.1, 0.8) ( 00, 11) Basic Idea: divide data space to 2 b cells, each with a representation (bit string). Scan whole signature file sequentially Weakness: lost precision of data and hence affect the size of query window Eg. (0.2, 0.8) (00, 11), which is as same as (0.0, 1.0) VA-file VA-file (Vector Approximation file)

‘Edge’ --- the max/min attribute of data point, which is also closer to the data space edge, comparing with other attributes. A data point whose “edge” not included in a query range is not an answer. Consider unit data space ([0,1], [0,1] … [0,1]) e.g. point A (0.1, 0.6), edge = 0.1; point B (0.6, 0.9), edge = 0.9; B (0.6, 0.9) A (0.1, 0.6) iMinMax -- Basic Concept

Indexing points using one of their attributes dd+1 Basic Concept

The probability of finding an attribute with very large (small) value increases with dimensionality. Eg. In 2-dim space [0..1], P(x i > 0.9) = 0.19 In 30-dim space, P(x i > 0.9) = (Uniform distribution) However, not all queries will search for large values U se this fact to “prune” away data points that are of no interest !

Using the max./min. attribute to build the index Only at most d sub-queries needed Algorithm is very simple. data point: range query: e.g. iMax key. sub-query is. * Similar arguments apply for iMin. y x x>y y>x iMax iMax or iMin

Points A(0.2,0.5) and B(0.87,0.25) in 2-dimensional space. Query ( [0.26,0.75], [0.13,0.6] ). query A B iMax iMin iMinMax sub-queries [0.26,0.75], [1.26,1.6] [0.26,0.6], [1.13,1.6] [0.26,0.75], [1.13,1.6] iMinMax -- Examples

Algorithm of iMinMax is still very simple iMinMax key: sub-query: For d-dimensional space, at most d sub-queries are needed. The union of of all answers of sub- queries yields the total answer. iMinMax’s Principle

Operations Range search: a query is transformed into d sub-queries, and for each, a normal B + -tree range search is performed. Point search: an attribute is selected based iMinMax criteria, and an exact match search is performed. Update: similar to those of B + -trees. B + -tree

(a) 2-dim. uniform data space(b) iMinMax keys of 30-dim. uniform data set (show any 2 dimensions) Data Distributions -- Uniform Distribution

(a) 2-dim. normal skewed data space (mean=0.6) (b) iMinMax keys of 30-dim. normally skewed data set Data Distributions – Normal Skewed Distribution

(a) 2-dim. exponential skewed data space (b) iMinMax keys of 30-dim. exponentially skewed data set Data Distributions -- Exponential Distribution

Introduce  to tune iMinMax for better performance. iMinMax(  ) key: E.g. Set  = 0.2, Point (0.1, 0.8), Query ([0, 0.6], [0.1,0.7]) key sub-queries iMinMax 0.1 [0, 0.6], [1.1, 1.7] checked iMinMax(  ) 1.8 [0, 0.6], [1.1, 1.7] not checked sub-query is still: ( independent of  ) iMinMaxPrinciple iMinMax(  )’s Principle

Default data distribution: uniform on [0..1] Default query selectivity: 0.1% Default data set: 500K Performance Study

Normal distribution Exponential distribution Data size 100K, 500K 500K Dimension Query side  -0.3 ~ 0.3 (0.5) -0.1 ~ 0.4 * The distribution of query range is the same as the data set. * Data sets skewed with different degree, the tuning effects are different. * The tuning ‘knob’, , enables iMinMax to scatter the skewed data points to reduce false drops. Performance Study -- iMinMax iMinMax(  ) on skewed data

performance gain is up to 66% data set size = 100K data set size = 500K skewed normal distribution Performance of iMinMax Performance of iMinMax(  )

skewed exponential distribution data size = 500K Performance of iMinMax Performance of iMinMax(  )