CpSc 881: Machine Learning PCA and MDS
2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!
3 Background: Covariance X=TemperatureY=Humidity Covariance: measures the correlation between X and Y cov(X,Y)=0: independent Cov(X,Y)>0: move same dir Cov(X,Y)<0: move oppo dir
4 Background: Covariance Matrix Contains covariance values between all possible dimensions (=attributes): Example for three attributes (x,y,z):
5 Background: eigenvalues AND eigenvectors Eigenvectors e : C e = e How to calculate e and : Calculate det(C- I), yields a polynomial (degree n) Determine roots to det(C- I)=0, roots are eigenvalues Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & Sons Or any math packages such as MATLAB
6 An Example X1X2X1'X2' Mean1=24.1 Mean2=53.8
7 Covariance Matrix C= Using MATLAB, we find out: Eigenvectors: e1=(-0.98, 0.21), 1=51.8 e2=(0.21, 0.98), 2=560.2 Thus the second eigenvector is more important!
8 Principal Component Analysis (PCA) Used for visualization of complex data Principle Component Analysis: project onto subspace with the most variance Developed to capture as much of the variation in data as possible Generic features of principal components summary variables linear combinations of the original variables uncorrelated with each other capture as much of the original variance as possible
9 PCA Algorithm 1. X Create N x d data matrix, with one row vector x n per data point 2. X subtract mean x from each row vector x n in X 3. Σ covariance matrix of X Find eigenvectors and eigenvalues of Σ PC’s the M eigenvectors with largest eigenvalues
10 Principal components 1.principal component (PC1) the direction along which there is greatest variation 2.principal component (PC2) the direction with maximum variation left in data, orthogonal to the direction (i.e. vector) of PC1 3.principal component (PC3) –the direction with maximal variation left in data, orthogonal to the plane of PC1 and PC2 –(Rarely used) –etc...
11 Geometric Rationale of PCA objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions (principal axes) that have the following properties: ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance,...., and axis p has the lowest variance covariance among each pair of the principal axes is zero (the principal axes are uncorrelated).
12 Example: 3 dimensions => 2 dimensions
13 PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, 8973 dimensions (genes) reduced to 2
14 How many components? Check the distribution of eigen-values Take enough many eigen-vectors to cover % of the variance
15 Problems and limitations What if very large dimensional data? e.g., Images (d ≥ 10 4 ) Problem: Covariance matrix Σ is size (d 2 ) d=10 4 |Σ| = 10 8 Singular Value Decomposition (SVD)! efficient algorithms available (Matlab) some implementations find just top N eigenvectors
16
17 Singular Value Decomposition Problem: #1: Find concepts in text #2: Reduce dimensionality
18 SVD - Definition A [n x m] = U [n x r] r x r] (V [m x r] ) T A: n x m matrix (e.g., n documents, m terms) U: n x r matrix (n documents, r concepts) : r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix) V: m x r matrix (m terms, r concepts)
19 SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U V T, where U, V: unique (*) U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) U T U = I; V T V = I (I: identity matrix) : singular value are positive, and sorted in decreasing order
20 SVD - Properties ‘spectral decomposition’ of the matrix: = xx u1u1 u2u2 1 2 v1v1 v2v2
21 SVD - Interpretation ‘documents’, ‘terms’ and ‘concepts’: U: document-to-concept similarity matrix V: term-to-concept similarity matrix : its diagonal elements: ‘strength’ of each concept Projection: best axis to project on: (‘best’ = min sum of squares of projection errors)
22 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx
23 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx CS-concept MD-concept doc-to-concept similarity matrix
24 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx ‘strength’ of CS-concept
25 SVD - Example A = U V T - example: data inf. retrieval brain lung = CS MD xx term-to-concept similarity matrix CS-concept
26 SVD – Dimensionality reduction Q: how exactly is dim. reduction done? A: set the smallest singular values to zero: = xx
27 SVD - Dimensionality reduction ~ xx
28 SVD - Dimensionality reduction ~
29 Multidimensional Scaling Procedures Similar in spirit to PCA but it takes a dissimilarity as input
30 Multidimensional Scaling Procedures The purpose of multidimensional scaling (MDS) is to map the distances between points in a high dimensional space into a lower dimensional space without too much loss of information.
31 Math MDS seeks values z_1,...,z_N in R^k to minimize the so-called stress function This is known as least squares or classical multidimensional scaling. A gradient descent algorithm is used to minimize S. A non-metric form of MDS is Sammons (1996) non-linear mapping. Here the following stress function is being minimized:
32 We use MDS to visualize the dissimilarities between objects. The procedures are very exploratory and their interpretations are as much art as they are science.
33 Examples The “points” that are represented in multidimensional space can be just about anything. These objects might be people, in which case MDS can identify clusters of people who are “close” versus “distant” in some real or psychological sense.
34 Multidimensional Scaling Procedures As long as the “distance” between the objects can be assessed in some fashion, MDS can be used to find the lowest dimensional space that still adequately captures the distances between objects. Once the number of dimensions is identified, a further challenge is identifying the meaning of those dimensions. Basic data representation in MDS is a dissimilarity matrix that shows the distance between every possible pair of objects. The goal of MDS is to faithfully represent these distances with the lowest possible dimensional space.
35 Multidimensional Scaling Procedures The mathematics behind MDS can be daunting to understand. Two types: classical (metric) multidimensional scaling and non-metric scaling. Example: Distances between cities on the globe
36 Multidimensional Scaling Procedures This table lists the distances between European cities. A multidimensional scaling of these data should be able to recover the two dimensions (North-South x East-West) that we know must underlie the spatial relations among the cities.
37 Multidimensional Scaling Procedures MDS begins by restricting the dimension of the space and then seeking an arrangement of the objects in that restricted space that minimizes the difference between the distances in that space compared to the actual distances.
38 Multidimensional Scaling Procedures Appropriate number of dimensions are identified… Objects can be plotted in the multidimensional space… Determine what objects cluster together and why they might cluster together. The latter issue concerns the meaning of the dimensions and often requires additional information.
39 Multidimensional Scaling Procedures In the cities data, the meaning is quite clear. The dimensions refer to the North-South x East-West surface area across which the cities are dispersed. We would expect MDS to faithfully recreate the map relations among the cities.
40 Multidimensional Scaling Procedures This arrangement provides the best fit for a one-dimensional model. How good is the fit? We use a statistic called “stress” to judge the goodness-of-fit.
41 Smaller stress values indicate better fit. Some rules of thumb for degree of fit are: StressFit.20 Poor.10 Fair.05 Good.02 Excellent Multidimensional Scaling Procedures
42 The stress for the one-dimensional model of the cities data is.31, clearly a poor fit. The poor fit can also be seen in a plot of the actual distances versus the distances in the one-dimensional model, known as a Shepard plot. In a good fitting model, the points will lie along a line, sloping upward to the right, showing a one-to-one correspondence between distances in the model space and actual distances. Clearly not evident here. Multidimensional Scaling Procedures
43 A two- dimensional model fits very well. The stress value is also quite small (.00902) indicating an exceptional fit. Of course, this is no great surprise for these data. Multidimensional Scaling Procedures
44 Not any room for a three- dimensional model to improve matters. The stress is.00918, indicating that a third dimension does not help at all. Multidimensional Scaling Procedures
45 MDS Example: Clusters among Prostate Samples