A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University of California, Los Angeles
Outline Introduction Structure of PK-tree Operations on PK-tree Performance Conclusions
Introduction Dynamic spatial index method has been an active research area. –index structure based on spatial decomposition PR-Quad tree, K-D tree, K-D-B tree,... No overlapping among sibling nodes How to achieve high disk page utilization for large dimensionality with skewed data distributions remains a challenge. –R-tree family of index structure R*-tree, SR-tree, X-tree,... Increasing of overlapping among sibling nodes along with increasing dimensionality degrades performance severely.
Introduction PK-tree –Spatial decomposition no overlapping among sibling nodes –Bound on height –Bounds on number of children –Uniqueness for any data set independent of order of insertion and deletion –Solid theoretical foundation –Fast retrieval and updates
Structure of PK-tree Recursively rectilinear dividing space dim 1 dim 2 ith level (i+1)th level... Set notation (e.g., , , , , , , ) is used to express relationships among cells.
Structure of PK-tree Space is recursively divided until a level L D such that each cell contains at most one point Level Level Level Level 1
Structure of PK-tree Point cell: a non-empty cell at level L D A cell C is K-instantiable iff –C is a point cell, or –there does not exist ( K-1) or less K-instantiable sub-cells C 1, …, C K-1 C, such that d D (d C d i=0 K-1 C i ) Level 3 (L D )Level Level 1 K = 3
Structure of PK-tree root Example of a PK-tree of rank 3 a2c2d1b4d2c3e1e2f3h1g1g2h2g3a7g5f6f5e5d8c8d7b8b Level 3 (L D ) abcdefgh UR Level 1 abcdefgh UR K NM BDMNK Level 2 abcdefgh BD K MN
Structure of PK-tree Given a finite set of points D over index space C 0 and dividing ration R, a PK-tree of rank K (K>1) is defined as follows. –The cell at level 0 (C 0 ) is always instantiated and serves as the root of the PK-tree. –Every node else (except the root) in the PK-tree is mapped one-to-one to a K-instantiable cell. –For any two nodes C 1 and C 2 in the PK-tree, C 1 is a child of C 2 (or C 2 is the parent of C 1 ) iff C 1 is a proper sub-cell of C 2, i.e., C 1 C 2, and there does not exist C 3 in the PK-tree such that C 1 C 3 and C 3 C 2. Properties: existence and uniqueness, bounds on node outdegree, bounded storage space, bounds on expected height, no overlapping among sibling nodes, and so on.
Properties of PK-tree H longest path Expected Height of a PK-tree CiCi C i+1 P(d C i+1 | d C i ) < 1... at least K-1... at least K-1... at least K-1... at least K-1 leaf root (N points)
Properties of PK-tree M-Level Clustering Spatial Distribution –0-level: uniform distribution over C 0 P(d C i+1 | d C i ) = 1/r –1-level: Let A C 0 be some subset of C 0 and A c = C 0 - A. Distributions for points in A and A c are 0-level clustering spatial distribution. A AcAc C i+1 CiCi...
Operations on PK-tree Pagination of the PK-tree –Pick the parameter K and the number of dimensions to split at each level such that the maximum size node is close to a page size. –Allocate one node to a page. –Space utilization can be guaranteed to be at least 50% and is much more than 50% in experiments. Insertion –First follow the path from the root to locate all (potential) ancestors of the inserted leaf cell. –Then from the leaf level back to the root along the same path to make all necessary changes (e.g., instantiate or de-instantiate cells). Search –K Nearest Neighbor Query –Range Query
Performance Setup: Sparc 10 workstation (SunOS 5.5) with 208 MB main memory and a local disk with 9GB capacity Synthetic Data Sets (each contains 100,000 points) –u: uniform distribution –c1, c2: 20% of data are uniformly distributed and 80% of data are distributed in disjoint clusters Height of generated trees
Performance Size of index in MB with 100,000 points
Performance Range query on uniform data distribution
Performance Range query on clustered data distribution
Performance KNN query on uniform data distribution
Performance KNN query on clustered data distribution
Performance Real data set: NASA Sky Telescope Data –200,000 two-dimensional points (they are the coordinates of crater locations on the surface of Mars)
Conclusions PK-tree: employing spatial decomposition to ensure no overlapping among sibling nodes but avoiding large number of nodes usually resulting from a skewed spatial distribution of objects. –The total number of nodes in a PK-tree is O(N) and the expected height of a PK-tree is O(logN) under some general conditions. Other properties: uniqueness, bounds on number of children. Empirical studies shown that the PK-tree outperforms SR-tree and X-tree by a wide margin.