High-dimensional Indexing based on Dimensionality Reduction Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi.

Slides:

Advertisements

Similar presentations

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Advertisements

Trees for spatial indexing

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Aggregating local image descriptors into compact codes

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:

Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.

2-dimensional indexing structure

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8

Principal Component Analysis

CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.

Dimensional reduction, PCA

High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION

Unsupervised Learning

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Multimedia and Time-series Data

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

Efficient Progressive Processing of Skyline Queries in Peer-to-Peer Systems INFOSCALE’06.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.

Efficient Semantic Based Content Search in P2P Network Heng Tao Shen, Yan Feng Shu, and Bei Yu.

The Curse of Dimensionality Richard Jang Oct. 29, 2003.

CSE 185 Introduction to Computer Vision Face Recognition.

Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

Presented by Ho Wai Shing

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-

Feature Extraction 主講人：虞台文.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

Principal Components Analysis ( PCA)

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Indexing Multidimensional Data

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Spatial Data Management

LSI, SVD and Data Management

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Similarity Search: A Matching Based Approach

Data Transformations targeted at minimizing experimental variance

Efficient Processing of Top-k Spatial Preference Queries

Donghui Zhang, Tian Xia Northeastern University

Presentation transcript:

High-dimensional Indexing based on Dimensionality Reduction Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi

Outlines Introduction Global Dimensionality Reduction Local Dimensionality Reduction Indexing Reduced-Dim Space Effects of Dimensionality Reduction Behaviors of Distance Matrices Conclusion and Future Works

Introduction High-Dim Applications: Multimedia, time-series, scientific, market basket, etc. Various Trees Proposed: R-tree, R*, R+, X, Skd, SS, M, KDB, TV, Buddy, Grid File, Hybrid, iDistance, etc. Dimensionality Curse Efficiency drops quickly as dim increases.

Introduction Dimensionality Reduction Techniques GDR LDR High-Dim Indexing on RDS Existing Indexing on single RDS Global Indexing on multiple RDS Side Effects of DR Different Behaviors of Distance Matrices Conclusion Future Work

GDR Perform Reduction on the whole dataset.

GDR Improving query accuracy by doing principal components analysis (PCA)

GDR Using Aggregate Data for Reduction in Dynamic Spaces [8].

GDR Works for Globally Correlated data. GDR may cause significant info loss in real data.

LDR [5] Find locally correlated data clusters Perform dimensionality reduction on on the clusters individually

LDR - Definitions Cluster and subspace Reconstruction Distance

LDR - Constraints on cluster Reconstruction distance bound I.e. MaxReconDist Dimensionality bound I.e. MaxDim Size Bound I.e. MinSize

LDR - Clustering Algo Construct spatial clusters Determine max number of clusters: M Determine the cluster range: e Choose a set of well scattered points as the centroids (C) of each spatial cluster Apply the formula to all data points: Distance (P, C closest ) <= e Update the centroids of the cluster

LDR - Clustering Algo (cont) Compute principal component (PC) Perform PCA individually to all clusters Compute mean value of each cluster points, I.e. E i Determine subspace dimensionality Progressively checking each point against: MaxReconDist and MaxDim Decide the optimal demensionality for each cluster

LDR - Clustering Algo (cont) Recluster points Insert each points into the a suitable cluster or the outlier set O I.e. ReconDist(P.S) <= MaxReconDist

LDR - Clustering Algo (cont) Finally, apply the Size Bound to eliminate clusters with too few population. Redistribute the points to other clusters or set O.

LDR - Compare to GDR LDR improves retrieval efficiency and effectiveness by capture more details on local data set. But it consumes higher computational cost during the reduction steps.

LDR LDR cannot discover all the possible correlated clusters.

Indexing RDS GDR One RDS only Applying existing multi-dim indexing structure, e.g. R-tree, M-Tree… LDR Several RDS in different axis systems Global Indexing Structure

Global Indexing Each RDS corresponds to one tree.

Side Effects of DR Information loss -> Lower precision Possible Improvement? Text Domain DR -> qualitative improvement Least information loss -> highest precision -> Highest qualitative improvement

Side Effects of DR Latent Semantic Indexing (U & V) (LSI) [9,10,11] Sim for docSim for term & correlation

Side Effects of DR DR effectively improve the data representation by understanding the data in terms of concepts rather than words. Directions with greatest variance results in the use of Semantic aspects of data.

Side Effects of DR Dependency among attributes results in poor measurements if using L-norm matrices. Dimensions with largest eigenvalues = highest quality [2]. So what else we have to consider?. Inter-correlations

Mahalanobis Distance Normalized Mahalanobis Distance

Mahalanobis vs. L-norm

Take local shape into consideration by computing variance and covariance. Tend to group points into elliptical clusters, which defines a multi-dim space whose boundaries determine the range of degree of correlation that is suitable for dim reduction. Define the standard deviation boundary of the cluster.

Incremental Ellipse aims to discover all the possible correlated clusters with different size, density and elongation.

Behaviors of Distance Matrices in High–dim Space KNN is meaningful in high-dim space? [1] Furthest Neighbor/Nearest Neighbor is almost 1 -> poor discrimination [4] One criterion as relative contrast:

Behaviors of Distance Matrices in High–dim Space on different dimensionality for different matrices

Behaviors of Distance Matrices in High–dim Space Relative Contrast on L-norm Matrices

Behaviors of Distance Matrices in High–dim Space For higher dimensionality, the relative contrast provided by a norm with smaller parameter is more likely to dominate another with a larger parameter. So L-norm Matrices with smaller parameter is a better choice for KNN searching in high-dim space.

Conclusion Two Dimensionality Reduction Methods GDR LDR Indexing Methods Existing Structure Global Indexing Structure Side Effects of DR Qualitative Improvement Both intra-variance and inter-variance Different behaviors for different matrices Smaller k achieves higher quality

Future work Propose a new Tree for real high dimensional indexing without reduction for dataset without correlations? (Beneath iDistance, further prune the searching sphere using LB-Tree)? Reduce the dim of data points which are the combination of multi-features, such as images (shape, color, text, etc).

References [1]: Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. ICDT 2001: [2]: Charu C. Aggarwal: On the Effects of Dimensionality Reduction on High Dimensional Similarity Search. PODS 2001 [3]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim: What Is the Nearest Neighbor in High Dimensional Spaces? VLDB 2000: [4]: K.Beyer, J.Goldstein, R.Ramakrishnan, and U.Shaft.When is nearest neighbors meaningful? ICDT, [5]: K.Chakrabart and S.Mehrotra.Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces.VLDB, pages , [6]: R.Weber, H.Schek, and S.Blott. A Quantitative Analysis and Performance Study for Similarity Search Methods in High Dimensional Spaces. VLDB, pages , [7]: C.Yu, B.C. Ooi, K.-L. Tan, and H.V. Jagadish. Indexing the Distance: An Efficient Method to KNN Processing. VLDB, 2001.

References [8]: K. V. R. Kanth, D. Agrawal, and A. K. Singh. Dimensionality reduction for similarity searching dynamic databases. SIGMOD, [9]: Jon M. Kleinberg, Andrew Tomkins: Applications of Linear Algebra in Information Retrieval and Hypertext Analysis. PODS 1999: [10]: Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala: Latent Semantic Indexing: A Probabilistic Analysis. PODS 1998: [11]: Chris H.Q. Ding. A similarity-based Probability model for latent semantic indexing. SIGIR 1999: [12]: Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim. What is the nearest neighbor in high dimensional spaces? VLDB 2000