The Curse of Dimensionality Richard Jang Oct. 29, 2003.

Slides:

Advertisements

Similar presentations

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Advertisements

Trees for spatial indexing

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

CS4432: Database Systems II

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.

1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.

When is “Nearest Neighbor Meaningful? Authors: Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan Uri Shaft Presentation by: Vuk Malbasa For CIS664 Prof.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)

Lazy vs. Eager Learning Lazy vs. eager learning

Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.

Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.

Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Instance Based Learning

Segmentation Divide the image into segments. Each segment:

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Introduction to Bioinformatics - Tutorial no. 12

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.

Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.

CS Instance Based Learning1 Instance Based Learning.

Nearest-Neighbor Classifiers Sec minutes of math... Definition: a metric function is a function that obeys the following properties: Identity:

Birch: An efficient data clustering method for very large databases

1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.

Multimedia and Time-series Data

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

High-dimensional Indexing based on Dimensionality Reduction Students: Qing Chen Heng Tao Shen Sun Ji Chun Advisor: Professor Beng Chin Ooi.

INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Presented by Ho Wai Shing

Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Flat clustering approaches

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Joint Moments and Joint Characteristic Functions.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

CS Machine Learning Instance Based Learning (Adapted from various sources)

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.

Spatial Data Management

SIMILARITY SEARCH The Metric Space Approach

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Instance Based Learning

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Instance Based Learning (Adapted from various sources)

K Nearest Neighbor Classification

William Norris Professor and Head, Department of Computer Science

Instance Based Learning

Similarity Search: A Matching Based Approach

Machine Learning: UNIT-4 CHAPTER-1

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

The Curse of Dimensionality Richard Jang Oct. 29, 2003

2 Preliminaries – Nearest Neighbor Search Given a collection of data points and a query point in m-dimensional metric space, find the data point that is closest to the query point Variation: k-nearest neighbor Relevant to clustering and similarity search Applications: Geographical Information Systems, similarity search in multimedia databases

3 NN Search Con’t Source: [2]

4 Problems with High Dimensional Data A point’s nearest neighbor (NN) loses meaning Source: [2]

5 Problems Con’t NN query cost degrades – more strong candidates to compare with In as few as 10 dimensions, linear scan outperforms some multidimensional indexing structures (e.g. SS tree, R* tree, SR tree) Biology and genomic data can have dimensions in the 1000’s.

6 Problems Con’t The presence of irrelevant attributes decreases the tendency for clusters to form Points in high dimensional space have high degree of freedom; they could be so scattered that they appear uniformly distributed

7 Problems Con’t In which cluster does the query point fall?

8 The Curse Refers to the decrease in performance of query processing when the dimensionality increases The focus of this talk will be on quality issues of NN search and on not performance issues In particular, under certain conditions, the distance between the nearest point and the query point equals the distance between the farthest and query point as dimensionality approaches infinity

9 Curse Con’t Source: N. Katayama, S. Satoh. Distinctiveness Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information. ICDE Conference, 2001.

10 Unstable NN-Query A nearest neighbor query is unstable for a given  > 0 if the distance from the query point to most data points is less than (1+  ) times the distance from the query point to its nearest neighbor Source: [2]

11 Notation

12 Definitions

13 Theorem 1 Under the conditions of the above definitions, if Then for any  > 0, If the distance distribution behaves in the above way, as dimensionality increases, all points will approach the same distance from the query point

14 Theorem Con’t Source: [2]

15 Theorem Con’t Source: [1]

16 Rate of Convergence At what dimensionality does NN-queries become unstable. Not easy to answer, so experiments were performed on real and synthetic data. If conditions of theorem are met, DMAX m /DMIN m should decrease with increasing dimensionality

17 Empirical Results Source: [2]

18 An Aside Assuming that theorem 1 holds, when using the Euclidian distance metric, and assuming that the data and query point distributions are the same, the performance of any convex indexing structure degenerates into scanning the entire data set for NN queries i.e., P(number of points fetched using any convex indexing structure = n) converges to 1 as m goes to 

19 Alternative Statement of Theorem 1 Distance between nearest and farthest point does not increase as fast as distance between query point and NN as dim approaches infinity Note: Dmax d – Dmin d does not necessarily go to 0

20 Alternative Statement Con’t

21 Background for Theorems 2 and 3 L k norm: L k (x,y) = sum(i=1 to d) (||x i - y i || k ) 1/k where x, y  R d, k  Z L 1 : Manhattan, L 2 : Euclidean L f norm: L f (x,y) = sum(i=1 to d) (||x i - y i || f ) 1/f where x, y  R d, f  (0,1)

22 Theorem 2 Dmax d – Dmin d grows at rate d (1/k)-(1/2)

23 Theorem 2 Con’t For L 1, Dmax d – Dmin d diverges For L 2, Dmax d – Dmin d converges to a constant For L k for k >= 3, Dmax d – Dmin d converges to 0. Here, NN-search is meaningless in high dimensional space

24 Theorem 2 Con’t Source: [1]

25 Theorem 2 Con’t Contradict Theorem 1? No, Dmin d grows faster than Dmax d – Dmin d as d increases

26 Theorem 3 Same as Theorem 2 except replace k with f. The smaller the fraction, the better the contrast Meaningful distance metric should result in accurate classification and be robust against noise

27 Empirical Results Fractional metrics improve the effectiveness of clustering algorithms such as k-means Source: [3]

28 Empirical Results Con’t Source: [3]

29 Empirical Results Con’t Source: [3]

30 Some Scenarios that Satisfy the Conditions of Theorem 1 More broad than the common IID assumption for the dimensions Sc 1: For P=(P 1,…,P m ) and Q=(Q 1,…,Q m ), P i ’s IID (same for Q i ’s), and up to the  2p’th  moment is finite Sc 2: Pi’s, Qi’s not IID; distribution in every dimension is unique and correlated with all other dimensions

31 Scenarios Con’t Sc 3: P i ’s, Q i ’s independent, not identically distributed, and variance in each added dimension converges to 0 Sc 4: Distance distribution cannot be described as distance in a lower dim plus new component from new dim; situation does not obey law of large of number

32 A Scenario that does not Satisfy the Condition Sc 5: Same as 1 except P i ’s are dependent (i.e., value dim 1 = value dim 2) (same for Q i ’s). Can be converted into 1-D NN problem Source: [2]

33 Scenarios in Practice that are Likely to Give Good Contrast Source: [2]

34 Good Scenarios Con’t Source: [2]

35 Good Scenarios Con’t When the number of meaningful/relevant dimensions is low Do NN-search on those attributes instead Projected NN-search: For a given query point, determine which combination of dimensions (axes-parallel projection) is the most meaningful. Meaningfulness is measured by a quality criterion

36 Projected NN-Search Quality criterion: Function that rates quality of projection based on the query point, database, and distance function Automated approach: Determine how similar the histogram of the distance distribution is to a two peak distance distribution Two peaks = meaningful projection

37 Projected NN-Search Con’t Since number of combinations of dimensions is exponential, they used heuristic algorithm First 3 to 5 dimensions, use genetic algorithm. Greedy-based search is used to add additional dimensions. Stop after a fixed number of iterations Alternative to automated approach: Relevant dimensions depend not only on the query point, but also on the intentions of the user. User should have some say in which dimensions are relevant

38 Conclusions Make sure enough contrast between query and data points. If distance to NN is not much different from average distance, the NN may not be meaningful When evaluating high-dimensional indexing techniques, should use data that do not satisfy Theorem 1 and should compare with linear scan Meaningfulness also depends on how you describe the object that is represented by the data point (i.e., the feature vector)

39 Other Issues After selecting relevant attributes, the dimensionality could still be high Reporting cases when data does not yield any meaningful nearest neighbor, i.e. indistinctive nearest neighbors

40 References 1.Alexander Hinneburg, Charu C. Aggarwal, Daniel A. Keim: What Is the Nearest Neighbor in High Dimensional Spaces? VLDB 2000: Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft: When Is ''Nearest Neighbor'' Meaningful? ICDT'99, pp Charu C. Aggarwal, Alexander Hinneburg, Daniel A. Keim: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. ICDT'01, pp