The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Slides:

Advertisements

Similar presentations

Trees for spatial indexing

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

Multimedia Database Systems

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.

Indexing and Range Queries in Spatio-Temporal Databases

Access Methods for Advanced Database Applications.

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.

2-dimensional indexing structure

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

B+-tree and Hashing.

Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.

Accessing Spatial Data

I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.

I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.

Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.

Chapter 3: Data Storage and Access Methods

Spatial Indexing I Point Access Methods.

Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.

1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

1 Database indices Database Systems manage very large amounts of data. –Examples: student database for NWU Social Security database To facilitate queries,

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

CS4432: Database Systems II

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

Indexing for Multidimensional Data An Introduction.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.

Indexing for Multidimensional Data An Introduction.

B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.

1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.

Observer Relative Data Extraction Linas Bukauskas 3DVDM group Aalborg University, Denmark 2001.

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

R* Tree By Rohan Sadale Akshay Kulkarni.  Motivation  Optimization criteria for R* Tree  High level Algorithm  Example  Performance Agenda.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.

Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.

Spatial Data Management

CPS216: Data-intensive Computing Systems

CS522 Advanced database Systems

Multiway Search Trees Data may not fit into main memory

Spatial Indexing I Point Access Methods.

Hash-Based Indexes Chapter 10

Indexing and Hashing Basic Concepts Ordered Indices

Spatial Indexing I R-trees

Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.

Database Design and Programming

File Processing : Multi-dimensional Index

Chapter 11 Instructor: Xin Zhang

Data Mining CSCI 307, Spring 2019 Lecture 23

Presentation transcript:

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany

Outline Introduction Problems of R-tree based structures X-tree Structure X-tree Algorithms Overall-Minimal Split Performance Evaluation

Introduction Objective - To index point and spatial data in high-dimensional space Dimensions - few tens to hundreds Hyper-rectangles Fields - CAD, Molecular biology Improves upon R*-tree Approach ‘Minimal Overlap Split’ Directory structure organization - ‘Supernodes’ Performance is better than R* tree and TV tree by 2 orders of magnitude

Previous work (on High Dimensional Data) Reduce dimensionality - two basic approaches: Data is highly clustered and correlated Occupy only some space Algorithms to transform to lower dimension Index using traditional multi-dimensional index structures Eg: SS Tree Small number of dimensions contain most of the information Eg: TV Tree BUT…reduced dimensions may still be too high

Problem with R* tree Why R*-Trees ? Handles both point and spatial data Spatial data is not transformed to point data Performance deteriorates rapidly with dimension. After detailed evaluations, found that Overlap in directory increases rapidly with growing dimensionality. Dimension=5, Overlap=90% Overlap  Query Performance 

Defining Overlap Intuitively - Percentage of volume covered by more than one directory hyper-rectangle Overlap R-tree node contains n hyper-rectangles {R 1, R 2, … R n } Overlap directly corresponds to query performance (only if query objects are uniformly distributed) Query distribution estimated by data distribution In high dimensional data queries and data are clustered

Defining Overlap (contd) Weighted Overlap More accurate Percentage of data objects in overlapping space

Defining Overlap (contd) Multi Overlap - How many R i ’s in the overlapping space ?

Overlap in R* Tree Dimensionality , Overlap  So multiple paths need to be searched for each query

X-Tree - eXtended node tree Goal - Efficient query processing of high dimensional point and spatial data General Approach Avoid overlaps when splitting Create supernode - extendible variable size directory node Solution - Dynamically organize directory to be hybrid Directory Structure Organizations Low dimension (no/low overlap) - Hierarchical High dimension (high overlap) - Linear Cheaper to linear scan than traverse multiple paths

X-Tree Structure Three types of nodes Directory Supernode Data node Split history Minimal Overlap Splits

X Tree Storing supernodes Memory / Replace nodes using priority function Storage utilization (Uniformly dist. data) Normal Directory nodes - 66% Supernode - 88% (m=5) Extreme Cases of X tree No supernode -> R-Tree Completely hierarchical Low dimension, non-overlapping data One large root supernode Linear directory High dimension, highly overlapping data

Algorithm

Insert Algorithm Determines a combination of hierarchical and linear structure Tries to find a topological (or) overlap-minimal split using R* Tree - heuristics If no splits are obtained, current directory node is extended to become a super node of twice the standard block-size If the current super node is full, additional block is added to the super node.

Insert Algorithm

X-Trees in different dimensions

Split Algorithm Addition of an MBR to a node may result in an overflow and cause a split Criteria to split a node: Find a split based on topological and geometrical properties of the MBR If the above step results in a greater overlap, try to use overlap-minimal split If the above step results in under-filled nodes, do not perform a split, and return false

Parameters MIN_FANOUT – Similar to value used in other index structures (35% - 45% Approx.) MAX_OVERLAP – System constant Balance between reading a supernode of twice block size and reading 2 blocks with a probability MaxO and one block with a probability (1-MaxO) (For T IO = 2ms, T Tr =4ms, T CPU =1ms, MaxO = 20%)

Delete and Update Update operation is a combination of delete and insert If there is an underflow in the supernodes due to deletion, they are merged to form a single directory node. Hence, the structure is dynamic.

Overlap Minimal Split

Determining the Overlap Minimal Split Partition MBRs (S) in directory node into two subsets (S 1, S 2 ) such that the MBRs of both subset overlap minimally Point-data Overlap free is possible Balanced cannot be guaranteed

Lemma 1 For uniformly distributed point data, an overlap free split is possible iff there is a dimension according to which all MBRs in the node have been previously split

Split History To determine dimension according to which all MBRs in S have been split Storage requirement is a few bits Split produces 2 MBRs from 1 Represent as a binary tree ‘Split tree’ Leaf node - corresponds to MBR in S Internal Node - Old non-existent MBRs, labeled with split axis used Characteristics Left subtree MBR has lower coordinates in the split dimension (Disjoint) Path - which dimensions has this MBR been split by? Root node - split dimension common to all MBRs

Lemma 2 For point data, an overlap free split always exists Probability of second overlap free split axis Probability that a split algorithm chooses the right split axis coincidentally is very low. Eg: R* tree Random choice Criteria different

Performance Evaluation

X-Tree vs TV-Tree / R*-Tree Faster Insertion rates into the X-tree (10.45 times faster than R*-Tree, about 170 insertions per sec for 150 MB index containing 16-Dimensional point data) High speed up of search time for point queries increase with increase in dimension (attributed to the fact that due to high overlap in high dimensions R*- tree accesses most of the directory pages) (* All results were carried out on HP735 workstation with 64MB main memory, 10GB index space on disk)

Number of Page Accesses, CPU Time

X-Tree outperforms the TV-Tree and R*-Tree up to orders of magnitude for point and nearest neighbor queries on both synthetic and real data. Since, the nearest neighbor queries require sorting on the min-max distance, the CPU- time is much higher, but, better than that of an R-Tree. Since, extended spatial objects, induce some overlap in the X-Tree as well, the speed-up for X-tree over the R*-Tree is lower than for point data (factor of about 8 for D=16) Number of Page Accesses, CPU Time

Performance - Speed up on Real Point Data

Performance - Speed up on Real Spatial Data

Conclusions R-Tree based index structures do not behave well in high dimensional spaces. X-Tree, with the concept of supernodes and overlap minimal split (Directory nodes extended over block size to avoid degeneration of the index) provides higher speed up for point and nearest neighbor queries. As the total search time of X-Tree grows logarithmically with the database size, it scales well for very large database sizes.

Questions?