Neural Computation Prof. Nathan Intrator

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Machine Learning Lecture 8 Data Processing and Representation
Dimension reduction (1)
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
© 2003 by Davi GeigerComputer Vision September 2003 L1.1 Face Recognition Recognized Person Face Recognition.
Principal Component Analysis
Principal Components. Karl Pearson Principal Components (PC) Objective: Given a data matrix of dimensions nxp (p variables and n elements) try to represent.
Principal component analysis (PCA)
Classification and Diagnostic of Cancers
Dimensional reduction, PCA
Projection Pursuit. Projection Pursuit (PP) PCA and FDA are linear, PP may be linear or non-linear. Find interesting “criterion of fit”, or “figure of.
Regionalized Variables take on values according to spatial location. Given: Where: A “structural” coarse scale forcing or trend A random” Local spatial.
Face Recognition Jeremy Wyatt.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Neural Computation Prof. Nathan Intrator
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
SVD(Singular Value Decomposition) and Its Applications
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Computational Intelligence: Methods and Applications Lecture 5 EDA and linear transformations. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
CSE 185 Introduction to Computer Vision Face Recognition.
Neuronal Goal Neuronal Goal n-dimensional vectors m-dimensional vectors m < n transform from 2 to 1 dimension Ex: We look for axes which minimise projection.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Neural Computation Prof. Nathan Intrator
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis (PCA)
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Unsupervised Learning II Feature Extraction
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
CSE 554 Lecture 8: Alignment
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Background on Classification
Exploring Microarray data
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Matrices and Vectors Review Objective
Principal Component Analysis (PCA)
Principal Component Analysis
PCA vs ICA vs LDA.
Computational Intelligence: Methods and Applications
Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors.
Feature space tansformation methods
Principal Components What matters most?.
Digital Image Processing Lecture 21: Principal Components for Description Prof. Charlene Tsai *Chapter 11.4 of Gonzalez.
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Principal Component Analysis
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Neural Computation 0368-4149-01 Prof. Nathan Intrator Tuesday 16:00-19:00 Schreiber 007 Office hours: Wed 4-5 nin@tau.ac.il

Outline Goals for neural learning - Unsupervised Goals for statistical/computational learning PCA ICA Exploratory Projection Pursuit Search for non-Gaussian distributions Practical implementations

Statistical Approach to Unsupervised Learning Understanding the nature of data variability Modeling the data (sometimes very flexible model) Understanding the nature of the noise Applying prior knowledge Extracting features based on: Prior knowledge Class prediction Unsupervised learning

Principal Component Analysis. Włodzisław Duch SCE, NTU, Singapore http://www.ntu.edu.sg/home/aswduch

Linear transformations – example 2D vectors X in a unit circle with mean (1,1); Y = A*X, A = 2x2 matrix The shape is elongated, rotated and the mean is shifted.

Invariant distances Euclidean distance is not invariant to general linear transformations This is invariant only for orthonormal matrices ATA = I that make rigid rotations, without stretching or shrinking distances. Idea: standardize the data in some way to create invariant distances.

Data standardization For each vector component X(j)T=(X1(j), ... Xd(j)), j=1 .. n calculate mean and std: n – number of vectors, d – their dimension Vector of mean feature values. Averages over rows.

Standard deviation Calculate standard deviation: Vector of mean feature values. Variance = square of standard deviation (std), sum of all deviations from the mean value. Transform X => Z, standardized data vectors

Std data Std data: zero mean and unit variance. Standardize data after making data transformation. Effect: data is invariant to scaling only (diagonal transformation). Distances are invariant, data distribution is the same. How to make data invariant to any linear transformations?

Data standardization example For our example Y=AX, assuming X means=1 and variances = 1 Transformation Vector of mean feature values. Variance check it! How to make this invariant?

Covariance matrix Variance (spread around mean value) + correlation between features. CX is d x d where X is d x n dimensional matrix of vectors shifted to their means. Covariance matrix is symmetric Cij = Cji and positive definite. Diagonal elements are variances (square of std), si2 = Cii Spherical distribution of data has Cij=I (unit matrix). Elongated ellipsoids: large off-diagonal elements, strong correlations between features.

Mahalanobis distance Linear combinations of features leads to rotations and scaling of data. Mahalanobis distance: is invariant to linear transformations:

Principal components How to avoid correlated features? Correlations  covariance matrix is non-diagonal ! Solution: diagonalize it, then use transformation that makes it diagonal to de-correlate features. In matrix form, X, Y are dxn, Z, CX, CY are dxd C – symmetric, positive definite matrix XTCX > 0 for ||X||>0; its eigenvectors are orthonormal: its eigenvalues are all non-negative Z – matrix of orthonormal eigenvectors (because Z is real+symmetric), transforms X into Y, with diagonal CY, i.e. decorrelated.

Matrix form Eigen problem for C matrix in matrix form:

Principal components PCA: old idea, C. Pearson (1901), H. Hotelling 1933 Y – principal components, or vectors X transformed using eigenvectors of CX Covariance matrix of transformed vectors is diagonal => ellipsoidal distribution of data. Result: PC are linear combinations of all features, providing new uncorrelated features, with diagonal covariance matrix = eigenvalues. Small li  small variance  data change little in direction Yi PCA minimizes C matrix reconstruction errors: Zi vectors for large li are sufficient to get: because vectors for small eigenvalues will have very small contribution to the covariance matrix.

Two components for visualization Diagonalization methods: see Numerical Recipes, www.nr.com New coordinate system: axis ordered according to variance = size of the eigenvalue. First k dimensions account for fraction of all variance (please note that li are variances); frequently 80-90% is sufficient for rough description.

PCA properties PC Analysis (PCA) may be achieved by: transformation making covariance matrix diagonal projecting the data on a line for which the sums of squares of distances from original points to projections is minimal. orthogonal transformation to new variables that have stationary variances True covariance matrices are usually not known, estimated from data. This works well on single-cluster data; more complex structure may require local PCA, separately for each cluster. PC is useful for: finding new, more informative, uncorrelated features; reducing dimensionality: reject low variance features, reconstructing covariance matrices from low-dim data.

PCA Wisconsin example Wisconsin Breast Cancer data: Collected at the University of Wisconsin Hospitals, USA. 699 cases, 458 (65.5%) benign (red), 241 malignant (green). 9 features: quantized 1, 2 .. 10, cell properties, ex: Clump Thickness, Uniformity of Cell Size, Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses. 2D scatterograms do not show any structure no matter which subspaces are taken!

Example cont. PC gives useful information already in 2D. Taking first PCA component of the standardized data: If (Y1>0.41) then benign else malignant 18 errors/699 cases = 97.4% Transformed vectors are not standardized, std’s are below. Eigenvalues converge slowly, but classes are separated well.

PCA disadvantages Useful for dimensionality reduction but: Largest variance determines which components are used, but does not guarantee interesting viewpoint for clustering data. The meaning of features is lost when linear combinations are formed. Analysis of coefficients in Z1 and other important eigenvectors may show which original features are given much weight. PCA may be also done in an efficient way by performing singular value decomposition of the standardized data matrix. PCA is also called Karhuen-Loève transformation. Many variants of PCA are described in A. Webb, Statistical pattern recognition, J. Wiley 2002.

2 skewed distributions PCA transformation for 2D data: First component will be chosen along the largest variance line, both clusters will strongly overlap, no interesting structure will be visible. In fact projection to orthogonal axis to the first PCA component has much more discriminating power. Discriminant coordinates should be used to reveal class structure.

High Dimensional Data Dimension Reduction Visualisation Classification Analysis Feature Extraction

Projection Pursuit what: An automated procedure that seeks interesting low dimensional projections of a high dimensional cloud by numerically maximizing an objective function or projection index. Huber, 1985

Curse of dimensionality Projection Pursuit why: Curse of dimensionality Less Robustness worse mean squared error greater computational cost slower convergence to limiting distributions … Required number of labelled samples increases with dimensionality.

What is an interesting projection In general: the projection that reveals more information about the structure. In pattern recognition: a projection that maximises class separability in a low dimensional subspace.

Projection Pursuit Dimensional Reduction Find lower-dimensional projections of a high-dimensional point cloud to facilitate classification. Exploratory Projection Pursuit Reduce the dimension of the problem to facilitate visualization.

Projection Pursuit How many dimensions to use for visualization for classification/analysis Which Projection Index to use measure of variation (Principal Components) departure from normality (negative entropy) class separability(distance, Bhattacharyya, Mahalanobis, ...) …

Projection Pursuit Which optimization method to choose We are trying to find the global optimum among local ones hill climbing methods (simulated annealing) regular optimization routines with random starting points.

Timetable for Dimensionality reduction Begin 16 April 1998 Report on the state-of-the-art. 1 June 1998 Begin software implementation 15 June 1998 Prototype software presentation 1 November 1998