Download presentation
1
Neural Computation 0368-4149-01 Prof. Nathan Intrator
Tuesday 16:00-19:00 Schreiber 7 Office hours: Wed 4-5 (c) Tralvex Yeap. All Rights Reserved
2
Outline Goals for neural learning - Unsupervised
Goals for statisical/computational learning PCA ICA Exploratory Projection Pursuit Search for non-Gaussian distributions Practical implementations (c) Tralvex Yeap. All Rights Reserved
3
Statistical Approach to Unsupervised Learning
Understanding the nature of data variability Modeling the data (sometimes very flexible model) Understanding the nature of the noise Applying prior knowledge Extracting features based on: Prior knowledge Class prediction Unsupervised learning (c) Tralvex Yeap. All Rights Reserved
4
Principal Component Analysis.
Włodzisław Duch SCE, NTU, Singapore (c) Tralvex Yeap. All Rights Reserved
5
transform from 2 to 1 dimension
Neuronal Goal We look for axes which minimise projection errors and maximise the variance after projection n-dimensional vectors m-dimensional m < n Ex: transform from 2 to 1 dimension (c) Tralvex Yeap. All Rights Reserved
6
more information (variance)
Algorithm (cont’d) Preserve as much of the variance as possible more information (variance) rotate less information project (c) Tralvex Yeap. All Rights Reserved
7
Linear transformations – example
2D vectors X in a unit circle with mean (1,1); Y = A*X, A = 2x2 matrix The shape is elongated, rotated and the mean is shifted. (c) Tralvex Yeap. All Rights Reserved
8
Invariant distances Euclidean distance is not invariant to general linear transformations This is invariant only for orthonormal matrices ATA = I that make rigid rotations, without stretching or shrinking distances. Idea: standardize the data in some way to create invariant distances. (c) Tralvex Yeap. All Rights Reserved
9
Data standardization For each vector component X(j)T=(X1(j), ... Xd(j)), j=1 .. n calculate mean and std: n – number of vectors, d – their dimension Vector of mean feature values. Averages over rows. (c) Tralvex Yeap. All Rights Reserved
10
Standard deviation Calculate standard deviation:
Vector of mean feature values. Variance = square of standard deviation (std), sum of all deviations from the mean value. Transform X => Z, standardized data vectors (c) Tralvex Yeap. All Rights Reserved
11
Std data Std data: zero mean and unit variance.
Standardize data after making data transformation. Effect: data is invariant to scaling only (diagonal transformation). Distances are invariant, data distribution is the same?? How to make data invariant to any linear transformations? (c) Tralvex Yeap. All Rights Reserved
12
Terminology (Covariance)
How two dimensions vary from the mean with respect to each other cov(X,Y) > 0: Dimensions increase together cov(X,Y) < 0: One increases, one decreases cov(X,Y) = 0: Dimensions are independent (c) Tralvex Yeap. All Rights Reserved
13
Terminology (Covariance Matrix)
Contains covariance values between all possible dimensions: Example for three dimensions (x,y,z) (Always symetric): cov(x,x) variance of component x (c) Tralvex Yeap. All Rights Reserved
14
Properties of the Cov matrix
Can be used for creating a distance that is not sensitive to linear transformation Can be used to find directions which maximize the variance Determines a Gaussian distribution uniquely (up to a shift) (c) Tralvex Yeap. All Rights Reserved
15
Data standardization example
For our example Y=AX, assuming X means=1 and variances = 1 Transformation Vector of mean feature values. Variance check it! How to make this invariant? (c) Tralvex Yeap. All Rights Reserved
16
Covariance matrix Variance (spread around mean value) + correlation between features. CX is d x d where X is d x n dimensional matrix of vectors shifted to their means. Covariance matrix is symmetric Cij = Cji and positive definite. Diagonal elements are variances (square of std), si2 = Cii Pearson correlation coefficient Spherical distribution of data has Cij=I (unit matrix). Elongated ellipsoids: large off-diagonal elements, strong correlations between features. (c) Tralvex Yeap. All Rights Reserved
17
Mahalanobis distance Linear combinations of features leads to rotations and scaling of data. Mahalanobis distance: is invariant to linear transformations: (c) Tralvex Yeap. All Rights Reserved
18
Principal components How to avoid correlated features?
Correlations covariance matrix is non-diagonal ! Solution: diagonalize it, then use transformation that makes it diagonal to de-correlate features. Z are the eigen vectors of Cx In matrix form, X, Y are dxn, Z, CX, CY are dxd C – symmetric, positive definite matrix XTCX > 0 for ||X||>0; its eigenvectors are orthonormal: its eigenvalues are all non-negative Z – matrix of orthonormal eigenvectors (because Z is real+symmetric), transforms X into Y, with diagonal CY, i.e. decorrelated. (c) Tralvex Yeap. All Rights Reserved
19
Matrix form Eigenproblem for C matrix in matrix form:
(c) Tralvex Yeap. All Rights Reserved
20
Principal components PCA: old idea, C. Pearson (1901), H. Hotelling 1933 Z – principal components, of vectors X transformed using eigenvectors of CX Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data. Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues. Small li small variance data change little in direction Yi PCA minimizes C matrix reconstruction errors: Zi vectors for large li are sufficient to get: because vectors for small eigenvalues will have very small contribution to the covariance matrix. (c) Tralvex Yeap. All Rights Reserved
21
Two components for visualization
Diagonalization methods: see Numerical Recipes, New coordinate system: axis ordered according to variance = size of the eigenvalue. First k dimensions account for fraction of all variance (please note that li are variances); frequently 80-90% is sufficient for rough description. (c) Tralvex Yeap. All Rights Reserved
22
Solving for Eigenvalues & Eigenvectors
Vectors x having same direction as Ax are called eigenvectors of A (A is an n by n matrix). In the equation Ax=x, is called an eigenvalue of A. Ax=x (A-I)x=0 How to calculate x and : Calculate det(A-I), yields a polynomial (degree n) Determine roots to det(A-I)=0, roots are eigenvalues Solve (A- I) x=0 for each to obtain eigenvectors x (c) Tralvex Yeap. All Rights Reserved
23
PCA properties PC Analysis (PCA) may be achieved by:
transformation making covariance matrix diagonal projecting the data on a line for which the sums of squares of distances from original points to projections is minimal. orthogonal transformation to new variables that have stationary variances True covariance matrices are usually not known, estimated from data. This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster. PC is useful for: finding new, more informative, uncorrelated features; reducing dimensionality: reject low variance features, reconstructing covariance matrices from low-dim data. (c) Tralvex Yeap. All Rights Reserved
24
PCA Wisconsin example Wisconsin Breast Cancer data:
Collected at the University of Wisconsin Hospitals, USA. 699 cases, 458 (65.5%) benign (red), 241 malignant (green). 9 features: quantized 1, , cell properties, ex: Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses. 2D scatterograms do not show any structure no matter which subspaces are taken! (c) Tralvex Yeap. All Rights Reserved
25
Example cont. PC gives useful information already in 2D.
Taking first PCA component of the standardized data: If (Y1>0.41) then benign else malignant 18 errors/699 cases = 97.4% Transformed vectors are not standardized, std’s are below. Eigenvalues converge slowly, but classes are separated well. (c) Tralvex Yeap. All Rights Reserved
26
PCA disadvantages Useful for dimensionality reduction but:
Largest variance determines which components are used, but does not guarantee interesting viewpoint for clustering data. The meaning of features is lost when linear combinations are formed. Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight. PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix. PCA is also called Karhuen-Loève transformation. Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002. (c) Tralvex Yeap. All Rights Reserved
27
Exercise (will be part of Ex. 1)
How would you calculate efficiently the PCA of data where the dimensionality d is much larger than the number of vector observations n? (c) Tralvex Yeap. All Rights Reserved
28
2 skewed distributions PCA transformation for 2D data:
First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible. In fact projection to orthogonal axis to the first PCA component has much more discriminating power. Discriminant coordinates should be used to reveal class structure. (c) Tralvex Yeap. All Rights Reserved
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.