E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

1 E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications: information retrieval & indexing  identify the k most important features or  reduce indexing dimensions for faster retrieval (low dim indices are faster)

2 E.G.M. PetrakisDimensionality Reduction2 Techniques  Eigenvalue analysis techniques [NR’92]  Karhunen-Loeve (K-L) transform  Singular Value Decomposition (SVD)  both need O(N 2 ) time  FastMap [Faloutsos & Lin 95]  dimensionality reduction and  mapping of objects to vectors  O(N) time

3 E.G.M. PetrakisDimensionality Reduction3 Mathematical Preliminaries  For an n x n square matrix S, for unit vector x and scalar value λ: Sx = λx  x: eigenvector of S  λ: eigenvalue of S  The eigenvectors of a symmetric matrix (S=S T ) are mutually orthogonal and its eigenvalues are real  r rank of a matrix: maximum number or independent columns or rows

4 E.G.M. PetrakisDimensionality Reduction4 Example 1  Intuition: S defines an affine transform y = Sx that involves scaling, rotation  eigenvectors: unit vectors along the new directions  eigenvalues denote scaling eigenvector of major axis

5 E.G.M. PetrakisDimensionality Reduction5 Example 2  If S is real and symmetric (S=S T ) then it can be written as S = UΛU T  the columns of U are eigenvectors of S  U: column orthogonal (UU T =I)  Λ: diagonal with the eigenvalues of S

6 E.G.M. PetrakisDimensionality Reduction6 Karhunen-Loeve (K-L)  Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs)  K-L gives a linear combination of axes  sorted by importance  keep the first k dims 2-dim points and the 2 K-L directions for k=1 keep x’

7 E.G.M. PetrakisDimensionality Reduction7 Computation of K-L  Put N vectors in rows in A=[a ij ]  Compute B=[a ij -a p ], where  Covariance matrix: C=B T B  Compute the eigenvectors of C  Sort in decreasing eigenvalue order  Approximate each object by its projections on the directions of the first k eigenvectors

8 E.G.M. PetrakisDimensionality Reduction8 Intuition  B shifts the origin of the center of gravity of the vectors by a p and has 0 column mean  C represents attribute to attribute similarity  C square, real, symmetric  Eigenvector and eigenvalues are computed on C not on A  C denotes the affine transform that minimizes the error  Approximate each vector with its projections along the first k eigenvectors

9 E.G.M. PetrakisDimensionality Reduction9 Example  Input vectors [1 2], [1 1], [0 0]  Then col.avgs are 2/3 and 1

10 E.G.M. PetrakisDimensionality Reduction10 SVD  For general rectangular matrixes  N x n matrix (N vectors, n dimensions)  groups similar entities (documents) together  Groups similar terms together and each group of terms corresponds to a concept  Given an N x n matrix A, write it as A = UΛV T  U: N x r column orthogonal (r: rank of A)  Λ: r x r diagonal matrix (non-negative, desc. order)  V: r x n column orthogonal matrix

11 E.G.M. PetrakisDimensionality Reduction11 SVD (cont,d)  A = λ 1 u 1 v 1 T + λ 2 u 2 v 2 T + … + λ r u r v r T  u, v are column vectors of U, V  SVD identifies rect. blobs of related values in A  The rank r of A: number of blobs

12 E.G.M. PetrakisDimensionality Reduction12 Example  Two types of documents: CS and Medical  Two concepts (groups of terms)  CS: data, information, retrieval  Medical: brain, lung Term/ Document datainformationretrievalbrainlung CS-TR111100 CS-TR222200 CS-TR311100 CS-TR455500 MED-TR100022 MED-TR200033 MED-TR300011

13 E.G.M. PetrakisDimensionality Reduction13 Λ VtVt U Example (cont,d)  U: document-to-document similarity matrix  V: term-to-document similarity matrix  v 12 = 0: data has 0 similarity with the 2 nd concept r=2r=2

14 E.G.M. PetrakisDimensionality Reduction14 SVD and LSI  SVD leads to “Latent Semantic Indexing” ( )  Terms that occur together are grouped into concepts  When a user searches for a term, the system determines the relevant concepts to search  LSI maps concepts to vectors in the concept space instead of the n-dim. document space  Concept space: is a lower dimensionality space

15 E.G.M. PetrakisDimensionality Reduction15 Examples of Queries  Find documents with the term “data”  Translate query vector q to concept space  The query is related to the CS concept and unrelated to the medical concept  LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query

16 E.G.M. PetrakisDimensionality Reduction16 FastMap  Works with distances, has two roles: 1.Maps objects to vectors so that their distances are preserved (then apply SAMs for indexing) 2.Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible

17 E.G.M. PetrakisDimensionality Reduction17 Main idea  Pretend that objects are points in some unknown n-dimensional space  project these points on k mutually orthogonal axes  compute projections using distance only  The heart of FastMap is the method that projects two objects on a line  take 2 objects which are far apart (pivots)  project on the line that connects the pivots

18 E.G.M. PetrakisDimensionality Reduction18 Project Objects on a Line  O a, O b : pivots, O i : any object  d ij : shorthand for D(O i,O j )  x i : first coordinate on a k dimensional space  If O i is close to O a, x i is small Apply cosine low :

19 E.G.M. PetrakisDimensionality Reduction19 Choose Pivots  Complexity: O(N)  The optimal algorithm would require O(N 2 ) time  steps 2,3 can be repeated 4-5 times to improve the accuracy of selection

20 E.G.M. PetrakisDimensionality Reduction20 Extension for Many Dimensions  Consider the (n-1)-dimensional hyperplane H that is perpendicular to line O ab  Project objects on H and apply previous step  choose two new pivots  the new x i is the next object coordinate  repeat this step until k dim. vectors are obtained  The distance on H is not D  D’: distance between projected objects

21 E.G.M. PetrakisDimensionality Reduction21 Distance on the Hyper-Plane H  D’ on H can be computed from the Pythagorean theorem  The ability to compute D’ allows for computing a second line on H etc. Pythagorean theorem :

22 E.G.M. PetrakisDimensionality Reduction22 Algorithm

23 E.G.M. PetrakisDimensionality Reduction23 Observations  Complexity: O(kN) distance calculations  k: desired dimensionality  k recursive calls, each takes O(N)  The algorithm records pivots in each call (dimension) to facilitate queries  the query is mapped to a k-dimensional vector by projecting it on the pivot lines for each dimension  O(1) computation/step: no need to compute pivots

24 E.G.M. PetrakisDimensionality Reduction24 Observations (cont,d)  The projected vectors can be indexed  mapping on 2-3 dimensions allows for visualization of the data space  Assumes Euclidean space (triangle rules)  not always true (at least after second step)  Approximation of pivots  some distances are negative  turn negative distances to 0

25 E.G.M. PetrakisDimensionality Reduction25 Application: Document Vectors

26 E.G.M. PetrakisDimensionality Reduction26 FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3

27 E.G.M. PetrakisDimensionality Reduction27 References  Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996  W. Press Numerical Recipes in C, Cambridge Univ. Press, 1988  LSI website:  C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets

