The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Slides:

Advertisements

Similar presentations

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Advertisements

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

Multimedia Database Systems

Indexing and Range Queries in Spatio-Temporal Databases

Searching on Multi-Dimensional Data

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Mario Rodriguez Revollo School of Computer Science, UCSP SlimSS-tree: A New Tree Combined SS- tree With Slim-down Algorithm Lifang Yang, Xianglin Huang,

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Spatial Mining.

Indexing Network Voronoi Diagrams*

2-dimensional indexing structure

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.

Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.

Chapter 3: Data Storage and Access Methods

Techniques and Data Structures for Efficient Multimedia Similarity Search.

Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..

1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.

Storage and Indexing February 26 th, 2003 Lecture 19.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.

Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.

What Is the Most Efficient Way to Select Nearest Neighbor Candidates for Fast Approximate Nearest Neighbor Search? Masakazu Iwamura, Tomokazu Sato and.

AAU A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing Presented by YuQing Zhang  Slobodan Rasetic Jorg Sander James Elding Mario A.

Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.

R ++ -tree: an efficient spatial access method for highly redundant point data Martin Šumák, Peter Gurský University of P. J. Šafárik in Košice.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.

1 Overview of Storage and Indexing Chapter 8 (part 1)

Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Exact indexing of Dynamic Time Warping

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Database Seminar The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors Authors : Christian Bohm, Alexey Pryakhin,

R-trees: An Average Case Analysis. R-trees - performance analysis How many disk (=node) accesses we ’ ll need for range nn spatial joins why does it matter?

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.

Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.

Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Spatial Data Management

CS522 Advanced database Systems

Record Storage, File Organization, and Indexes

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.

Spatial Indexing I Point Access Methods.

KD Tree A binary search tree where every node is a

Spatio-Temporal Databases

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Storage and Indexing.

R-trees: An Average Case Analysis

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Donghui Zhang, Tian Xia Northeastern University

Presentation transcript:

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara Institute of Science and Technology) Shunsuke Uemura (Nara Institute of Science and Technology) Haruhiko Kojima (NTT Cyber Solutions Laboratories)

Introduction n Demand –High-performance multimedia database systems –Content-based retrieval with high speed and accuracy n Multimedia databases –Large size –Various features, high-dimensional data n More efficient spatial indices for high- dimensional data

Our Approach n VA-File and SR-tree are excellent search methods for high-dimensional data. n Comparisons of them motivated the concept of the A-tree. –No comparisons of them have been reported. –We performed experiments using various data sets n Approximation tree (A-tree) –Relative approximation: MBRs and data objects are approximated based on their parent MBR. –About 77% reduction in the number of page accesses compared with VA-File and SR-tree

Related Work (1) R5R3R4R8R6R7 R1R2 Non-leaf Node Leaf Node n R-tree family –Tree structure using MBRs (Minimum Bounding Rectangles) and/or MBSs (Minimum Bounding Spheres) –SR-tree: Structured by both MBRs and MBSs Outperforms SS-tree and R*-tree for 16-dimensional data R1 R2 R3 R4 R5 R6 R7 R8

Related Work (2) n VA-File (Vector Approximation File) –Use approximation file and vector file 1. Divide the entire data space into cells 2. Approximate vector data by using the cells, then create the approximation file 3. Select candidate vectors by scanning the approximation file 4. Access to the candidate vectors in the vector file –Better than X-tree and R*-tree beyond dimensionality of ApproximationVector Data

Experimental Results and Analysis n Structure suitable for non-uniformly distributed data –Structure changes according to data distribution. n Large entry size for high-dimensional spaces –Large entries small fanout many node accesses n Changing node size and fanout –Larger node size does NOT lead to low IO cost. –Larger fanout always contributes to the reduction in node accesses. n MBS contribution –The contribution of MBSs in node pruning is small in high- dimensional spaces. --- Properties of the SR-tree ---

Experimental Results and Analysis n Data skew degenerates search performance. –Absolute approximation: the approximation is independent of data distribution. –Effective for uniformly distributed data –Unsuitable for non-uniformly distributed data A large amount of dense data tends to be approximated by the same value. Absolute approximation leads to large approximation errors. --- Properties of the VA-File ---

The A-tree (Approximation tree) n Ideas from the SR-tree and VA-File comparison: –Tree structure Tree structures are suitable for non-uniformly distributed data. –Relative approximation MBRs and data objects are approximated based on their parent bounding rectangle. Small approximation error Small entry size and large fanout low IO cost –Partial usage of MBSs in high-dimensional searches MBSs are not stored in the A-tree. The centroid of data objects in a subtree is used only for update.

(28, 20) (28, 4)(4, 4) (4, 20) Rectangle A Rectangle B VBR C (22, 16) (22, 10) (10, 10) (10, 16) (11, 15) (11, 11)(21, 11) (21, 15) Virtual Bounding Rectangle (VBR) n C approximates a rectangle B. n C is calculated from rectangles A and B. n Search using VBRs guarantees the same result as that of MBRs.

i-th dimensional coordinate axis Edge of rectangle A Edge of rectangle B Subspace Code n Subspace code represents a VBR. n The edge of child MBR B is quantized in relation to the edge of parent MBR A. n The edge of B is approximated as a pair of 8- ary codes (1, 2) or binary codes (001, 010).

Rectangle B VBR C Subspace Code n C is the VBR of B in A n C is represented by the subspace codes: S = (010, 011, 101, 101) Rectangle A

The A-tree Structure n Relative approximation: –MBRs and data objects in child nodes are approximated based on parent MBR. n Configuration –One node contains partial information of rectangles in two consecutive generations. M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs SC(C1) and SC(C2): subspace codes of VBRs for the data objects M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs SC(C1) and SC(C2): subspace codes of VBRs for the data objects M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure P1 and P2: data objects, M1 -- M4: MBRs SC(V1) -- SC(V4): subspace codes of VBRs for the MBRs SC(C1) and SC(C2): subspace codes of VBRs for the data objects CD1 -- CD4: centroid of the data objects in the subtree M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space)

The A-tree Structure M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 Index nodes Data nodes n Data nodes n Index nodes –leaf nodes –intermediate nodes –root node

The A-tree Structure M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 Index nodes Data nodes n Data node –data objects –pointers to the data description records Data node

The A-tree Structure M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 Index nodes Data nodes n Leaf node –an MBR –a pointer to the data node –subspace codes of VBRs Leaf nodes

The A-tree Structure M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 Index nodes Data nodes n Intermediate node –an MBR –a list of entries a pointer to the child node the subspace code of a VBR the centroid of data objects in the subtree the number of the data objects Intermediate nodes

The A-tree Structure M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD1CD2 CD3CD4 Index nodes Data nodes n Root node: –a list of entries a pointer to the child node the subspace code of a VBR the centroid of data objects in the subtree the number of the data objects Root node

Search Algorithm n Basic ideas: –VBRs are calculated from parent MBR and the subspace codes. –Exception: the entire space is used in the root node. –The algorithm uses calculated VBRs for pruning. M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 Root node R (Entire space)

Search Algorithm n Calculate V1 and V2 from R, SC(V1) and SC(V2) M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

Search Algorithm n Calculate V1 and V2 from R, SC(V1) and SC(V2) n Calculate V3 and V4 from M1, SC(V3) and SC(V4) M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

Search Algorithm n Calculate V1 and V2 from R, SC(V1) and SC(V2) n Calculate V3 and V4 from M1, SC(V3) and SC(V4) n Calculate C1 and C2 from M3, SC(C1) and SC(C2) M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

Search Algorithm n Calculate V1 and V2 from R, SC(V1) and SC(V2) n Calculate V3 and V4 from M1, SC(V3) and SC(V4) n Calculate C1 and C2 from M3, SC(C1) and SC(C2) n Access to P1 M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) V2 V1 M1 M2 V4 M4 V3 M3 P2 P1 C1 C2 R (Entire space) Query point

Update Algorithm n Basic idea: –Based on the update algorithm of the SR-tree, but: –Needs to update subspace codes CD3M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD4 CD1CD2 P3 SC(C3)

Code Calculation Parent MBR VBRs

Code Calculation Inserted point Parent MBR VBRs n If parent MBR does not change, calculate the subspace code for the inserted data object.

Code Calculation Inserted point Parent MBR VBRs Inserted point n If parent MBR does not change, calculate the subspace code for the inserted data object. n If parent MBR changes, calculate all subspace codes

n Update data node and leaf node –Insert a new data object P3 –Update M3 Update Algorithm CD3M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD4 CD1CD2 P3 SC(C3)

n Update data node and leaf node –Insert a new data object P3 –Update M3 –If M3 does not change, calculate SC(C3). Update Algorithm CD3M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD4 CD1CD2 P3 SC(C3)

n Update data node and leaf node –Insert a new data object P3 –Update M3 –If M3 does not change, calculate SC(C3). –If M3 changes, calculate SC(C1), SC(C2) and SC(C3). Update Algorithm CD3M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD4 CD1CD2 P3 SC(C3)

n Update intermediate node –If M3 changes, update M1. –If M3 changes but M1 does not change, calculate SC(V3). –If M1 changes, calculate SC(V3), SC(V4). –Calculate CD3 Update Algorithm CD3M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD4 CD1CD2 P3 SC(C3)

n Update root node –If M1 changes, calculate SC(V1) –Calculate CD1 Update Algorithm CD3M1 SC(V3)SC(V4) M2 M4M3 SC(C1) P1P2 SC(V1)SC(V2) SC(C2) CD4 CD1CD2 P3 SC(C3)

Performance Test n Data sets: real data set (hue histogram image data), uniformly distributed data set, cluster data set. n Data size: 100,000 n Dimension: varies from 4 to 64 n Page size: 8 KB n 20-nearest neighbor queries n Evaluation is based on the average for 1,000 insertion or query points. n CPU: 296 MHz n Code length: –The code length that gave the best performance was chosen. –A-tree: code length varies from 4 to 12. –VA-File: code length varies from 4 to 8 according to [18].

Search Performance n A-tree gives significantly superior performance! n 77% reduction in number of page accesses for 64-dimensional real data n Relative approximation –Small entry size and large fanout low IO cost Real dataUniformly distributed data

Approximation error ε: error of the distance between p and V i during a search p : query point, S : the number of visited VBRs, V i : visited VBRs, M i : the MBRs corresponding to V i n Optimum code length depends on dimensionality and data distribution Influence of Code Length

VA-File/A-tree Comparison n VA-File (absolute approximation) –approximated using the entire space edge length 2 -l n A-tree (relative approximation) –approximated using parent MBR smaller VBR size, fewer object accesses Edge length of VBRs/cellsNumber of data object accesses

CPU-time n CPU-time for real data –Similar to the SR-tree and outperforms the VA-File n VA-File –Calculates the approximated position coordinate for all objects n A-tree –Reducing node accesses leads to low CPU cost.

Insertion and Storage Cost n Increase in the insertion cost is modest. n About 20% less storage cost for 64-dimensional data (1) VBRs need only small storage volumes. (2) The number of index nodes is extremely small. Insertion costStorage cost

n The A-tree offers excellent search performance for high-dimensional data –Relative approximation MBRs and data objects in child nodes are approximated based on parent MBR. –About 77% reduction in the number of page accesses compared with VA-File and SR-tree n Future work –Cost model for finding optimum code length Conclusions

Contribution of MBSs for Pruning n SR-tree contains both MBRs and MBSs but: the frequency of the usage of MBSs decreases as dimensionality increases.