A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

CMU SCS : Multimedia Databases and Data Mining Lecture#2: Primary key indexing – B-trees Christos Faloutsos - CMU

Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Indexing Network Voronoi Diagrams*

2-dimensional indexing structure

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.

Spatial Indexing SAMs.

Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1

1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.

Chapter 3: Data Storage and Access Methods

Spatial Indexing I Point Access Methods.

Techniques and Data Structures for Efficient Multimedia Similarity Search.

1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.

CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Spatial Temporal Data Mining

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Multimedia Databases Chapter 4.

R-TREES: A Dynamic Index Structure for Spatial Searching by A. Guttman, SIGMOD Shahram Ghandeharizadeh Computer Science Department University of.

R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.

Chapter 61 Chapter 6 Index Structures for Files. Chapter 62 Indexes Indexes are additional auxiliary access structures with typically provide either faster.

R-Trees Extension of B+-trees.  Collection of d-dimensional rectangles.  A point in d-dimensions is a trivial rectangle.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.

 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

CS Data Structures Chapter 5 Trees. Chapter 5 Trees: Outline  Introduction  Representation Of Trees  Binary Trees  Binary Tree Traversals 

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Indexing for Multidimensional Data An Introduction.

R ++ -tree: an efficient spatial access method for highly redundant point data Martin Šumák, Peter Gurský University of P. J. Šafárik in Košice.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

Starting at Binary Trees

Antonin Guttman In Proceedings of the 1984 ACM SIGMOD international conference on Management of data (SIGMOD '84). ACM, New York, NY, USA.

1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.

1 Searching Searching in a sorted linked list takes linear time in the worst and average case. Searching in a sorted array takes logarithmic time in the.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.

ITEC 2620M Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: ec2620m.htm Office: TEL 3049.

1 Trees. 2 Trees Trees. Binary Trees Tree Traversal.

1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.

CS422 Principles of Database Systems Indexes

Spatial Data Management

Tree-Structured Indexes: Introduction

CS522 Advanced database Systems

Chapter 25: Advanced Data Types and New Applications

RE-Tree: An Efficient Index Structure for Regular Expressions

Spatial Indexing I Point Access Methods.

Advanced Topics in Data Management

B+-Trees and Static Hashing

B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.

Spatial Indexing I R-trees

15-826: Multimedia Databases and Data Mining

Tree-Structured Indexes

Efficient Aggregation over Objects with Extent

Presentation transcript:

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University of California, Los Angeles

Outline Introduction Structure of PK-tree Operations on PK-tree Performance Conclusions

Introduction Dynamic spatial index method has been an active research area. –index structure based on spatial decomposition PR-Quad tree, K-D tree, K-D-B tree,... No overlapping among sibling nodes How to achieve high disk page utilization for large dimensionality with skewed data distributions remains a challenge. –R-tree family of index structure R*-tree, SR-tree, X-tree,... Increasing of overlapping among sibling nodes along with increasing dimensionality degrades performance severely.

Introduction PK-tree –Spatial decomposition no overlapping among sibling nodes –Bound on height –Bounds on number of children –Uniqueness for any data set independent of order of insertion and deletion –Solid theoretical foundation –Fast retrieval and updates

Structure of PK-tree Recursively rectilinear dividing space dim 1 dim 2 ith level (i+1)th level... Set notation (e.g., , , , , , ,  ) is used to express relationships among cells.

Structure of PK-tree Space is recursively divided until a level L D such that each cell contains at most one point Level Level Level Level 1

Structure of PK-tree Point cell: a non-empty cell at level L D A cell C is K-instantiable iff –C is a point cell, or –there does not exist ( K-1) or less K-instantiable sub-cells C 1, …, C K-1  C, such that  d  D (d  C  d   i=0 K-1 C i ) Level 3 (L D )Level Level 1 K = 3

Structure of PK-tree root Example of a PK-tree of rank 3 a2c2d1b4d2c3e1e2f3h1g1g2h2g3a7g5f6f5e5d8c8d7b8b Level 3 (L D ) abcdefgh UR Level 1 abcdefgh UR K NM BDMNK Level 2 abcdefgh BD K MN

Structure of PK-tree Given a finite set of points D over index space C 0 and dividing ration R, a PK-tree of rank K (K>1) is defined as follows. –The cell at level 0 (C 0 ) is always instantiated and serves as the root of the PK-tree. –Every node else (except the root) in the PK-tree is mapped one-to-one to a K-instantiable cell. –For any two nodes C 1 and C 2 in the PK-tree, C 1 is a child of C 2 (or C 2 is the parent of C 1 ) iff C 1 is a proper sub-cell of C 2, i.e., C 1  C 2, and there does not exist C 3 in the PK-tree such that C 1  C 3 and C 3  C 2. Properties: existence and uniqueness, bounds on node outdegree, bounded storage space, bounds on expected height, no overlapping among sibling nodes, and so on.

Properties of PK-tree H longest path Expected Height of a PK-tree CiCi C i+1 P(d  C i+1 | d  C i ) < 1... at least K-1... at least K-1... at least K-1... at least K-1 leaf root (N points)

Properties of PK-tree M-Level Clustering Spatial Distribution –0-level: uniform distribution over C 0 P(d  C i+1 | d  C i ) = 1/r –1-level: Let A  C 0 be some subset of C 0 and A c = C 0 - A. Distributions for points in A and A c are 0-level clustering spatial distribution. A AcAc C i+1 CiCi...

Operations on PK-tree Pagination of the PK-tree –Pick the parameter K and the number of dimensions to split at each level such that the maximum size node is close to a page size. –Allocate one node to a page. –Space utilization can be guaranteed to be at least 50% and is much more than 50% in experiments. Insertion –First follow the path from the root to locate all (potential) ancestors of the inserted leaf cell. –Then from the leaf level back to the root along the same path to make all necessary changes (e.g., instantiate or de-instantiate cells). Search –K Nearest Neighbor Query –Range Query

Performance Setup: Sparc 10 workstation (SunOS 5.5) with 208 MB main memory and a local disk with 9GB capacity Synthetic Data Sets (each contains 100,000 points) –u: uniform distribution –c1, c2: 20% of data are uniformly distributed and 80% of data are distributed in disjoint clusters Height of generated trees

Performance Size of index in MB with 100,000 points

Performance Range query on uniform data distribution

Performance Range query on clustered data distribution

Performance KNN query on uniform data distribution

Performance KNN query on clustered data distribution

Performance Real data set: NASA Sky Telescope Data –200,000 two-dimensional points (they are the coordinates of crater locations on the surface of Mars)

Conclusions PK-tree: employing spatial decomposition to ensure no overlapping among sibling nodes but avoiding large number of nodes usually resulting from a skewed spatial distribution of objects. –The total number of nodes in a PK-tree is O(N) and the expected height of a PK-tree is O(logN) under some general conditions. Other properties: uniqueness, bounds on number of children. Empirical studies shown that the PK-tree outperforms SR-tree and X-tree by a wide margin.