Footer Here1 Feature Selection Copyright, 1996 © Dale Carnegie & Associates, Inc. David Mount For CMSC 828K: Algorithms and Data Structures for Information.

Slides:



Advertisements
Similar presentations
Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Searching on Multi-Dimensional Data
Dimensionality Reduction PCA -- SVD
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
A Nonlinear Approach to Dimension Reduction Lee-Ad Gottlieb Weizmann Institute of Science Joint work with Robert Krauthgamer TexPoint fonts used in EMF.
Two Technique Papers on High Dimensionality Allan Rempel December 5, 2005.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Chapter 5 Orthogonality
Principal Component Analysis
Dimensionality Reduction and Embeddings
Dimensionality Reduction
Approximate Nearest Subspace Search with Applications to Pattern Recognition Ronen Basri, Tal Hassner, Lihi Zelnik-Manor presented by Andrew Guillory and.
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
Face Recognition Jeremy Wyatt.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
A Global Geometric Framework for Nonlinear Dimensionality Reduction Joshua B. Tenenbaum, Vin de Silva, John C. Langford Presented by Napat Triroj.
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
6 6.3 © 2012 Pearson Education, Inc. Orthogonality and Least Squares ORTHOGONAL PROJECTIONS.
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
Stats & Linear Models.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
Volume distortion for subsets of R n James R. Lee Institute for Advanced Study & University of Washington Symposium on Computational Geometry, 2006; Sedona,
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Presented By Wanchen Lu 2/25/2013
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F.
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
AN ORTHOGONAL PROJECTION
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Conformational Space.  Conformation of a molecule: specification of the relative positions of all atoms in 3D-space,  Typical parameterizations:  List.
13 th Nov Geometry of Graphs and It’s Applications Suijt P Gujar. Topics in Approximation Algorithms Instructor : T Kavitha.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Spectral Methods Tutorial 6 1 © Maks Ovsjanikov
K Nearest Neighbor Classification
Dimensionality Reduction and Embeddings
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Feature space tansformation methods
Generally Discriminant Analysis
Lecture 15: Least Square Regression Metric Embeddings
Presentation transcript:

Footer Here1 Feature Selection Copyright, 1996 © Dale Carnegie & Associates, Inc. David Mount For CMSC 828K: Algorithms and Data Structures for Information Retrieval

CMSC 828K2 Introduction Information retrieval for multimedia and other non-text databases: –Image and multimedia: Images, video, audio. –Medical databases: ECGs, X-rays, MRI scans. –Time series data: Financial data, sales, meteorological, geological data. –Biology: Genome, proteins. Query-by-example: Find objects that are similar to a given query object.

CMSC 828K3 Distance and Metrics Similarity is defined in terms of a distance function. An important subclass are metrics. Metric Space: A pair (X,d ) where X is a set and d is a distance function such that for x,y in X : Hilbert Space: A vector space and a norm |v |, which defines a metric d (u,v)=|v-u|.

CMSC 828K4 Examples of Metrics Minkowski L p Metric: Let X = R k. L 1 (Manhattan): L 2 (Euclidean): L inf (Chessboard):

CMSC 828K5 More Metrics Hamming Distance: Let X = {0,1} k. Number of 1-bits in the exclusive-or Edit Distance: Let X =  *. Minimum number of single character changes (insert, delete, swap) to convert x into y.

CMSC 828K6 Objective Efficiency issues: –Edit distance requires quadratic time to compute. –Data structures: There are many data structures for storing vector data. –Curse of dimensionality: Most nearest neighbor algorithms have running times that grow exponentially with dimension. Objective: Embed objects into a low-dimensional space and use a Minkowski metric, allowing for a small distortion in distances.

CMSC 828K7 Isometries and Embeddings Isometry: A distance preserving mapping f from metric space (A, d A ) to (B, d B ): Contractive: A mapping f that does not increase distances: c is the distortion of the embedding.

CMSC 828K8 Feature Selection Main Problem: Given some finite metric space (X, d ), where |X |=n, map it to some Hilbert space, say, (R k, L p ). Ideally both the dimension and the distortion should be as low as possible. Feature Selection: We can think of each coordinate of the resulting vector as describing some feature of the object. Henceforth, feature means coordinate.

CMSC 828K9 Overview We present methods for mapping a metric space (X, d ) to (R k, L p ). Let |X | = n. Multidimensional Scaling: A classical method for embedding metric spaces into Hilbert spaces. Lipschitz Embeddings: From any finite metric space to (R k, L p ) where k is O(log 2 n) with O(log n) distortion. SparseMap: A practical variant of LLR embeddings. KL-transform and FastMap: Methods based on projections to lower dimensional spaces.

CMSC 828K10 Multidimensional Scaling Finding the best mapping of a metric space (X,d ) to a k-dimensional Hilbert space (R k,d’ ) is a nonlinear optimization problem. The objective function is the stress between the distances: Standard incremental methods (e.g., steepest descent) can be used to search for a mapping of minimal stress.

CMSC 828K11 Limitations MDS has a number of limitations that make it difficult to apply to information retrieval. –All O(n 2 ) distances in the database need to be computed in the embedding process. Too large for most real databases. –In order to map a query point into the embedded space, a similar optimization problem needs to be solved, requiring computation of O(n) distances.

CMSC 828K12 Lipschitz Embeddings Each coordinate or feature is the distance to the closest point of a subset of X. Bourgain (1985) Any n-point metric space (X, d ) can be embedded into O(log n)-dimensional Euclidean space with O(log n) distortion. Linial, London, and Rabinovich (1995) Generalized to any L p metric, and showed how it to compute it by a randomized algorithm. Dimension increases to O(log 2 n).

CMSC 828K13 LLR Embedding Algorithm Let q = O(log n). [Constant affects prob of success.] For i = 1,2,…,lg n do For j = 1 to q do A i,j = random subset of X of size 2 i. Map x to the vector {d i,j }/Q where d i,j is the distance from x to the closest point in A i,j and Q is the total number of subsets. A 2,1 A 2,3 A 2,2 A 1,2 A 1,3 A 1,1 a d e f c b x

CMSC 828K14 Why does it work? Assume L 1 metric. Let d(x,A) denote the distance from x to the closest point in A. Observe that for any x,y and any subset A, by the triangle inequality we have The mapping is contractive. A y x d(x,y) d(x,A) d(y,A)

CMSC 828K15 Why does it work? Let B (x, r x ) denote the set of points within distance r x of x. Suppose r y < r x and A is a subset such that then Will show this happens often. y x ryry rxrx A

CMSC 828K16 Why does it work? Fix points x and y. For t = 1,2,… let r t be the smallest value such that Repeat as long as r t < d(x,y)/2. Note that for all r t, B (x, r t ) and B (y, r t ) are disjoint. t = 2 rtrt y rtrt x

CMSC 828K17 Why does it work? When the algorithm picks a random subset A of size roughly n/2 t, with constant probability (>1/10): or vice versa. Repeating this q = O(log n) times the expected number of such sets is at least q/10. x rtrt y r t-1 A

CMSC 828K18 Why does it work? Repeating over all q subsets in this group, with high probability we have: Repeating this for all the size groups t=1,2,…, with high probability we have

CMSC 828K19 Review of the proof Since Q = O(log 2 n) the final distortion is O(log n). When we sample subsets of size n/2 t, we have a constant probability of finding subsets whose corresponding coordinate differs by at least r t -r t-1. We repeat the sample O(log n) times so that this occurs with high probability. There are O(log n) size groups so the total dimension is O(log 2 n). The sum of the r t -r t-1 terms telescope totaling to d(x,y) /2, and repeated sampling adds an O(log n) factor.

CMSC 828K20 How low can you go? Folklore: Any n-point metric space can be embedded into (R n, L inf ) with distortion 1. (A point is mapped to a vector of distances to the other points.) Bourgain: There is an n-point metric space such that any embedding in (R k,L 2 ) has distortion of at least O(log n / log log n). Johnson-Lindenstrauss: You can embed n points in (R t, L 2 ) into (R k, L 2 ) where k = O((log n)/  2 ) with distortion 1+ . (Select k unit vectors from R t at random. Each coordinate is the length of the orthogonal projection onto each vector.)

CMSC 828K21 Limitations Limitations of the LLR embedding: O(log n) distortion: Experiments show that the actual distortions may be much smaller. O(log 2 n) dimension: This is a real problem. O(n 2 ) distance computations must be performed in the process of embedding and embedding a query point requires O(n) distance computations: Too high if distance function is complex.

CMSC 828K22 SparseMap SparseMap (Hristescu and Farach-Colton) is a variant of the LLR-embedding: Incremental construction of features: Once the first k coordinates have been computed, we can use these (rather than the distance function) to perform... Distance Approximation: For computing the distance from a point x to a subset A i,j. Greedy Resampling: Keep only the best (most discriminating) coordinates.

CMSC 828K23 Distance Approximation Suppose that the first k ’ coordinates have been selected. This gives a partial distance estimate To compute d(x,A) for a new subset A: –Compute d’ k’ (x,y) for each y in A. –Select y with the smallest such distance estimate. –Return d(x,y). Only 1 distance calculation, as opposed to |A|. Total number is of distances is O(kn).

CMSC 828K24 Greedy Resampling Suppose that k coordinates have been computed. Want to keep only the best coordinates. –Sample some number of random pairs (x,y) from the database, and compute their distances d(x,y). –Select the coordinate (subset A) that minimizes the stress between the true and estimated distances: –Repeat selecting the coordinate which together with the previous coordinates minimizes stress.

CMSC 828K25 KL-Transform Karhunen-Loeve transform: Given n points (vectors) in R t, project them into a lower dimensional space R k, so as to minimize the mean squared error.

CMSC 828K26 KL-Transform Translate the points so that their mean coincides with the origin. Let X denote the matrix whose columns are the resulting vectors. Compute the covariance matrix  =XX T. Compute the eigenvectors  i and the eigenvalues i of  in decreasing order. Project the columns of X orthogonally onto the subspace spanned by  i, i =1,…,k.

CMSC 828K27 FastMap The KL-transform assumes that points are in R t. What if we have a finite metric space (X, d )? Faloutsos and Lin (1995) proposed FastMap as metric analogue to the KL-transform. Imagine that the points are in a Euclidean space. –Select two pivot points x a and x b that are far apart. –Compute a pseudo-projection of the remaining points along the “line” x a x b. –“Project” the points to an orthogonal subspace and recurse.

CMSC 828K28 Selecting the Pivot Points The pivot points should lie along the principal axes, and hence should be far apart. –Select any point x 0. –Let x 1 be the furthest from x 0. –Let x 2 be the furthest from x 1. –Return (x 1, x 2 ). x0x0 x2x2 x1x1

CMSC 828K29 Pseudo-Projections Given pivots (x a, x b ), for any third point y, we use the law of cosines to determine the relation of y along x a x b. The pseudo-projection for y is This is first coordinate. xaxa xbxb y cycy d a,y d b,y d a,b

CMSC 828K30 “Project to orthogonal plane” Given distances along x a x b we can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem. Using d ’(.,.), recurse until k features chosen. xbxb xaxa y z y’ z’ d y’,z’ d y,z c z -c y

CMSC 828K31 Experimental Results Faloutsos and Lin present experiments showing that FastMap is faster than MDS for a given level of stress, but has much higher stress than MDS. All data sets were relatively small. Hristescu and Farach-Colton present experiments on protein data showing that SparseMap is considerably faster than FastMap, and it performs marginally better with respect to stress and cluster preservation.

CMSC 828K32 Summary Lipschitz Embeddings: Embedding metric spaces using distances to subsets. –O(log 2 n) dimensions and O(log n) distortion. –Likely the best from a theoretical point of view. –SparseMap, a practical variant of this idea. Johnson-Lindenstrauss: Can embed any n-point set in Euclidean space into O(log n) dimensions. KL-transform: Embedding that minimizes squared errors. FastMap mimics this idea in metric spaces.

CMSC 828K33 Bibliography G. Hjaltson and H. Samet, “Contractive embedding methods for similarity searching in general metric spaces”, manuscript. J. Bourgain, “On Lipschitz embedding of finite metric spaces in Hilbert space”, Israel J. of Math., 52, 1985, G. Hristescu and M. Farach-Colton, “Cluster preserving embeddings of proteins”, DIMACS Tech. Rept

CMSC 828K34 Bibliography N. Linial, E. London and Y. Rabinovich, “The Geometry of Graphs and some of its algorithmic applications”, Combinatorica, 15, 1995, C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets”, Proc. ACM SIGMOD, 1995, K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, F. W. Young and R. M. Hamer, Multidimensional Scaling: History, Theory and Applications, Lawrence Erlbaum Associates, Hilldale, NJ, 1987.