Download presentation
Presentation is loading. Please wait.
Published byDominic Hampton Modified over 9 years ago
1
E.G.M. PetrakisDimensionality Reduction1 Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important features or reduce indexing dimensions for faster retrieval (low dim indices are faster)
2
E.G.M. PetrakisDimensionality Reduction2 Techniques Eigenvalue analysis techniques [NR’92] Karhunen-Loeve (K-L) transform Singular Value Decomposition (SVD) both need O(N 2 ) time FastMap [Faloutsos & Lin 95] dimensionality reduction and mapping of objects to vectors O(N) time
3
E.G.M. PetrakisDimensionality Reduction3 Mathematical Preliminaries For an n x n square matrix S, for unit vector x and scalar value λ: Sx = λx x: eigenvector of S λ: eigenvalue of S The eigenvectors of a symmetric matrix (S=S T ) are mutually orthogonal and its eigenvalues are real r rank of a matrix: maximum number or independent columns or rows
4
E.G.M. PetrakisDimensionality Reduction4 Example 1 Intuition: S defines an affine transform y = Sx that involves scaling, rotation eigenvectors: unit vectors along the new directions eigenvalues denote scaling eigenvector of major axis
5
E.G.M. PetrakisDimensionality Reduction5 Example 2 If S is real and symmetric (S=S T ) then it can be written as S = UΛU T the columns of U are eigenvectors of S U: column orthogonal (UU T =I) Λ: diagonal with the eigenvalues of S
6
E.G.M. PetrakisDimensionality Reduction6 Karhunen-Loeve (K-L) Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs) K-L gives a linear combination of axes sorted by importance keep the first k dims 2-dim points and the 2 K-L directions for k=1 keep x’
7
E.G.M. PetrakisDimensionality Reduction7 Computation of K-L Put N vectors in rows in A=[a ij ] Compute B=[a ij -a p ], where Covariance matrix: C=B T B Compute the eigenvectors of C Sort in decreasing eigenvalue order Approximate each object by its projections on the directions of the first k eigenvectors
8
E.G.M. PetrakisDimensionality Reduction8 Intuition B shifts the origin of the center of gravity of the vectors by a p and has 0 column mean C represents attribute to attribute similarity C square, real, symmetric Eigenvector and eigenvalues are computed on C not on A C denotes the affine transform that minimizes the error Approximate each vector with its projections along the first k eigenvectors
9
E.G.M. PetrakisDimensionality Reduction9 Example Input vectors [1 2], [1 1], [0 0] Then col.avgs are 2/3 and 1
10
E.G.M. PetrakisDimensionality Reduction10 SVD For general rectangular matrixes N x n matrix (N vectors, n dimensions) groups similar entities (documents) together Groups similar terms together and each group of terms corresponds to a concept Given an N x n matrix A, write it as A = UΛV T U: N x r column orthogonal (r: rank of A) Λ: r x r diagonal matrix (non-negative, desc. order) V: r x n column orthogonal matrix
11
E.G.M. PetrakisDimensionality Reduction11 SVD (cont,d) A = λ 1 u 1 v 1 T + λ 2 u 2 v 2 T + … + λ r u r v r T u, v are column vectors of U, V SVD identifies rect. blobs of related values in A The rank r of A: number of blobs
12
E.G.M. PetrakisDimensionality Reduction12 Example Two types of documents: CS and Medical Two concepts (groups of terms) CS: data, information, retrieval Medical: brain, lung Term/ Document datainformationretrievalbrainlung CS-TR111100 CS-TR222200 CS-TR311100 CS-TR455500 MED-TR100022 MED-TR200033 MED-TR300011
13
E.G.M. PetrakisDimensionality Reduction13 Λ VtVt U Example (cont,d) U: document-to-document similarity matrix V: term-to-document similarity matrix v 12 = 0: data has 0 similarity with the 2 nd concept r=2r=2
14
E.G.M. PetrakisDimensionality Reduction14 SVD and LSI SVD leads to “Latent Semantic Indexing” ( http://lsi.research.telcordia.com/lsi/LSIpapers.html ) http://lsi.research.telcordia.com/lsi/LSIpapers.html Terms that occur together are grouped into concepts When a user searches for a term, the system determines the relevant concepts to search LSI maps concepts to vectors in the concept space instead of the n-dim. document space Concept space: is a lower dimensionality space
15
E.G.M. PetrakisDimensionality Reduction15 Examples of Queries Find documents with the term “data” Translate query vector q to concept space The query is related to the CS concept and unrelated to the medical concept LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query
16
E.G.M. PetrakisDimensionality Reduction16 FastMap Works with distances, has two roles: 1.Maps objects to vectors so that their distances are preserved (then apply SAMs for indexing) 2.Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible
17
E.G.M. PetrakisDimensionality Reduction17 Main idea Pretend that objects are points in some unknown n-dimensional space project these points on k mutually orthogonal axes compute projections using distance only The heart of FastMap is the method that projects two objects on a line take 2 objects which are far apart (pivots) project on the line that connects the pivots
18
E.G.M. PetrakisDimensionality Reduction18 Project Objects on a Line O a, O b : pivots, O i : any object d ij : shorthand for D(O i,O j ) x i : first coordinate on a k dimensional space If O i is close to O a, x i is small Apply cosine low :
19
E.G.M. PetrakisDimensionality Reduction19 Choose Pivots Complexity: O(N) The optimal algorithm would require O(N 2 ) time steps 2,3 can be repeated 4-5 times to improve the accuracy of selection
20
E.G.M. PetrakisDimensionality Reduction20 Extension for Many Dimensions Consider the (n-1)-dimensional hyperplane H that is perpendicular to line O ab Project objects on H and apply previous step choose two new pivots the new x i is the next object coordinate repeat this step until k dim. vectors are obtained The distance on H is not D D’: distance between projected objects
21
E.G.M. PetrakisDimensionality Reduction21 Distance on the Hyper-Plane H D’ on H can be computed from the Pythagorean theorem The ability to compute D’ allows for computing a second line on H etc. Pythagorean theorem :
22
E.G.M. PetrakisDimensionality Reduction22 Algorithm
23
E.G.M. PetrakisDimensionality Reduction23 Observations Complexity: O(kN) distance calculations k: desired dimensionality k recursive calls, each takes O(N) The algorithm records pivots in each call (dimension) to facilitate queries the query is mapped to a k-dimensional vector by projecting it on the pivot lines for each dimension O(1) computation/step: no need to compute pivots
24
E.G.M. PetrakisDimensionality Reduction24 Observations (cont,d) The projected vectors can be indexed mapping on 2-3 dimensions allows for visualization of the data space Assumes Euclidean space (triangle rules) not always true (at least after second step) Approximation of pivots some distances are negative turn negative distances to 0
25
E.G.M. PetrakisDimensionality Reduction25 Application: Document Vectors
26
E.G.M. PetrakisDimensionality Reduction26 FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3
27
E.G.M. PetrakisDimensionality Reduction27 References Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996 W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988 LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html http://lsi.research.telcordia.com/lsi/LSIpapers.html C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.