2806 Neural Computation Principal Component Analysis Lecture 8 2005 Ari Visa.

Slides:



Advertisements
Similar presentations
© Negnevitsky, Pearson Education, Introduction Introduction Hebbian learning Hebbian learning Generalised Hebbian learning algorithm Generalised.
Advertisements

Introduction to Neural Networks Computing
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
2806 Neural Computation Self-Organizing Maps Lecture Ari Visa.
Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.
Pattern Recognition and Machine Learning: Kernel Methods.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
5/16/2015Intelligent Systems and Soft Computing1 Introduction Introduction Hebbian learning Hebbian learning Generalised Hebbian learning algorithm Generalised.
Artificial neural networks:
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Pattern Recognition and Machine Learning
Principal Component Analysis
Slides are based on Negnevitsky, Pearson Education, Lecture 8 Artificial neural networks: Unsupervised learning n Introduction n Hebbian learning.
Curve-Fitting Regression
Self Organization: Hebbian Learning CS/CMPE 333 – Neural Networks.
Un Supervised Learning & Self Organizing Maps Learning From Examples
November 9, 2010Neural Networks Lecture 16: Counterpropagation 1 Unsupervised Learning So far, we have only looked at supervised learning, in which an.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能研究室 Introduction To Principal Component Analysis.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Lecture 09 Clustering-based Learning
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Summarized by Soo-Jin Kim
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Unsupervised learning
Presented By Wanchen Lu 2/25/2013
Deep Learning – Fall 2013 Instructor: Bhiksha Raj Paper: T. D. Sanger, “Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network”,
Artificial Neural Network Unsupervised Learning
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Self Organizing Feature Map CS570 인공지능 이대성 Computer Science KAIST.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Unsupervised Learning Motivation: Given a set of training examples with no teacher or critic, why do we learn? Feature extraction Data compression Signal.
Contents PCA GHA APEX Kernel PCA CS 476: Networks of Neural Computation, CSD, UOC, 2009 Conclusions WK9 – Principle Component Analysis CS 476: Networks.
381 Self Organization Map Learning without Examples.
Sampling and estimation Petter Mostad
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Adaptive Cooperative Systems Chapter 8 Synaptic Plasticity 8.11 ~ 8.13 Summary by Byoung-Hee Kim Biointelligence Lab School of Computer Sci. & Eng. Seoul.
Principle Component Analysis and its use in MA clustering Lecture 12.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Signal & Weight Vector Spaces
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Neural Networks 2nd Edition Simon Haykin
1 Chapter 8 – Symmetric Matrices and Quadratic Forms Outline 8.1 Symmetric Matrices 8.2Quardratic Forms 8.3Singular ValuesSymmetric MatricesQuardratic.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Feature Extraction 主講人:虞台文.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
J. Kubalík, Gerstner Laboratory for Intelligent Decision Making and Control Artificial Neural Networks II - Outline Cascade Nets and Cascade-Correlation.
Self-Organizing Network Model (SOM) Session 11
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Blind Signal Separation using Principal Components Analysis
Principal Component Analysis
Outline Associative Learning: Hebbian Learning
Singular Value Decomposition SVD
Feature space tansformation methods
The Naïve Bayes (NB) Classifier
Principal Components What matters most?.
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
16. Mean Square Estimation
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

2806 Neural Computation Principal Component Analysis Lecture Ari Visa

Agenda n Some historical notes n Some theory n Principal component analysis n Conclusions

Some Historical Notes Pearson (1901) introduced the Principal component analysis in a biological context to recast linear regression analysis into a new form. Hotelling (1933) developed it further in work done on psychometry. Karhunen (1947) considered it in the setting of probability theory. The theory was generalized by Loéve (1963).

Some Historical Notes n Ljung 1977, Kushner & Clark 1978 asymptotic stability theorem n Földiak, 1989 expanded the neural network configuration for principal components analysis by including anti-Hebbian feedback connections. n The APEX model (Kung, Diamantaras,1990) n Hebbian networks (Karhunen & Joutsensalo, 1995) n Nonlinear PCA (Diamantaras, Kung, 1996)

Some Theory n Global order can arise from local interactions (Turing 1952). n Network organization takes place at two that interact with each other in the form of a feedback loop. n Activity: certain activity patterns are produced by a given network in response to input signals. n Connectivity: Connection strengths (synaptic weights) of the network is modified in response to neuronal signals in the activity patterns, due to synaptic plasticity. n The following principles provide the neurobiological basis for the adaptive algorithms for principal component analysis:

Some Theory n 1. Modifications in synaptic weights tend to self-amplify (von der Malsburg, 1990). n 2. Limitation of resources leads to competition among synapses and therefore the selection of the most vigorously growing synapses (i.e., the fittest) at the expense of the others (von der Malsburg, 1990). n 3. Modifications in synaptic weights tend to cooperate (Barlow, 1989). n 4. Order and structure in the activation patterns represent redundant information that is acquired by the neural network in the form of knowledge, which is a necessary prerequisite to self-organized learning.

Some Theory n Consider the transformation from data space to feature space. n Is there an invertible linear transform T such that the truncation of Tx is optimum in the mean-squared error sense?  Yes, principle component analysis ( = Karhunen- Loéve transformation) n Let X denote an m-dimensional random vector representing the environment of interest. Let’s assume E[X] = 0; n Let q denote a unit vector of dimension m onto which the vector X is to be projected. n A = X T q = q T X, the projection A is a random variable with a mean and variance related to the statistics of the ramdom vector X. n E[A] = q T E[X] = 0 n  2 = E[A 2 ] = q T E[XX T ]q = q T Rq n The m-by-m matrix R is the correlation matrix of the random vector X. n R is symmetric: R T = R  a T Rb= b T Ra when a and b are any m-by-1 vectors.

Some Theory n Now the problem can be seen as the eigenvalue problem Rq = q n The problem has nontrial solutions (q≠0) only for special values of that are called the eigenvalues of the correlation matrix R. n The associated values of q are called eigenvectors. n R q j = j q j j = 1,2,...,m n Let the corresponding eigenvalues be arranged in decreasing order: 1 > 2 >... > j >...> m so that 1 = max. n Let the associated eigenvectors be used to construct an m-by-m matrix : Q =[q 1, q 2,..., q j,..., q m ] n RQ = Q  where  is a diagonal matrix defined by the eigenvalues of matrix R:  = diag[ 1, 2,..., j,..., m ] n The matrix Q is an orthogonal (unitary) matrix in the sense that its column vectors satisfy the conditions of orthonaormality : q i T q j = 1, if i=j, 0 if i≠j  Q T Q=I and Q T= Q -1 n The orthogonal similarity transformation: Q T RQ =  or q j T Rq k = j, if k=j, 0 if k≠j n The correlation matrix R may itself be expressed in terms of its eigenvalues and eigenvectors as R =  m i=1 i q i q i T (the spectral theorem). n These are the two equivalent representations of the eigendecompositions of the correlation matrix R.

Some Theory n The eigenvectors of the correlation matrix R pertaining to the zero-mean random vector X define the unit vectors q j, representing the principal directions along which the variance probes have their extremal values. n The associated eigenvalues define the extremal values of the variance probes. n The practical value of principal component analysis is that it provides an effective technique for dimensionality reduction. n Let the data vector x denote a realization of the random vector X. n The original data vector x may be constructed as : x =  m j=1 a i q j. n Let 1, 2,..., l denote the largest l eigenvalues of the correlation matrix R. we may approximate the data vector x by truncating the expansion after l terms: : x^ =  m j=1 a i q j, l  m.

Some Theory n The approximation error vector e equals the difference between the original data vector x and the approximating data vector x^ : e = x – x^. n e =  m j=l+1 a i q j n The error vector e is ortogonal to the approximating data vector x^. n  m j=1  j 2 =  m j=1 j n To perform dimensionality reduction on some input data, we compute the eigenvalues and eigenvectors of the correlation matrix of the input data vector, and then project the data orthogonally onto the subspace spanned by the eigenvectors belonging to the dominant eigenvalues (subspace decomposition).

Principal Component Analysis n Hebbian-based maximum eigenfilter n The neuron receives a set of m input signals x 1, x 2,...,x m through a corresponding set of m synapses with weights w 1, w 2,..., w m respectively. n y =  i m w i x i

Principal Component Analysis n In accordance with Hebb’s postulate of learning, a synaptic weight w i varies with time, growing strong when the presynaptic signal x i and postsynaptic signal y coincide with each other. n w i (n+1)= w i (n) +  y(n)x i (n), i = 1,2,...,m where n denotes time and  is the learning-rate parameter  saturation, normalization is needed w i (n+1)= [w i (n) +  y(n)x i (n)]/{  i m [w i (n) +  y(n)x i (n)]²}½ (Oja, 1982) Assuming that the learning-rate parameter  is small w i (n+1)= w i (n) +  y(n)[x i (n)-y(n)w i (n)]+O(  ²) which consists of the Hebbian term and the stabilizing term x’ i (n) = x i (n)-y(n)w i (n) w i (n+1)= w i (n) +  y(n)x’ i (n) Positive feedback for self-amplification and therefore growth of the synaptic weights w i (n) according to its external input x i (n). Negative feedback due to –y(n) for controlling the growth, thereby resulting in stabilization of the synaptic weight w i (n).

Principal Component Analysis n matrix formulation of the algorithm n x(n) = [x 1 (n), x 2 (n),...,x m (n) ] T n w(n) = [w 1 (n), w 2 (n),...,w m (n) ] T n y(n) = x T (n)w(n) = w T (n)x(n) n w(n+1)= w(n) +  y(n)[x(n)-y(n)w(n)] n w(n+1)= w(n) +  [x(n)x T (n)w(n) - w T (n)x(n)x T (n)w(n)w(n)] n represents a nonlinear stochastic difference equation

Principal Component Analysis n The goal of the procedure described here is to associate a deterministic ordinary differential equation (ODE) with the stochastic nonlinear difference equation. n the asymptotic stability theorem : lim w(n) = q 1 when n  ∞ infinitely often with probability 1

Principal Component Analysis A single linear neuron governed by the self-organized learning rule, w(n+1)= w(n) +  y(n)[x(n)-y(n)w(n)], converges with probability 1 to a fixed point, which is characterized as follows: 1.The variance of the model output approaches the largest eigenvalue of the correlation matrix R, as shown by lim  ²(n) = 1, n  ∞ 2. The synaptic weight vector of the model approches the associated eigenvector, as shown by lim w(n) = q 1,, n  ∞ with lim ||w(n)|| = 1, n  ∞

Principal Component Analysis n Hebbian-based principal components analysis n The single linear neuronal model may be expanded into a feedforward network with a single layer of linear neurons for purpose of principal components analysis of arbitary size on the input.

Principal Component Analysis n The only aspect of the network that is the subject to training is the set of synaptic weights [w ji ], connecting source nodes i in the input layer to computation nodes j in the output layer, where i = 1,2,...,m and j =1,2,...,l. n The output y j (n) of neuron j at time n, produced in response to the set of inputs {x i (n)|i=1,2,...,m} is given by y j (n) =  i=1 m w ji (n)x i (n), j=1,2,...,l n The synaptic weight w ji (n) is adapted in accordance with a generalized Hebbian algorithm GHA n ∆w ji (n) =  [y j (n)x i (n) - y j (n)  k=1 j w ki (n)y k (n)], i =1,2,...,m and j =1,2,...,l where ∆w ji (n) is the change applied to the synaptic weight w ji (n) at time n, and  is the learning-rate parameter.

Principal Component Analysis

n By rewriting the GHA n ∆w ji (n) =  y j (n)[x’ i (n) - w ii (n)y j (n)], i=1,2,...,m, j=1,2,...,l and x’ i (n) = x i (n)-  k=1 j-1 w ki (n)y k (n) n By rewriting once again n ∆w ji (n) =  y j (n)x’’ i (n) where x’’ i (n) = x’ i (n) - w ii (n)y j (n), n Note that w ii (n+1) = w ii (n) + ∆w ji (n), and w ji (n) = z -1 [w ji (n+1)]

Principal Component Analysis n GHA in matrix notation n ∆w j (n) =  y j (n)x’(n) -  y j (n)²w j (n), where j =1,2,...,l and x’(n) = x(n) -  k=1 j- 1 w k (n)y k (n) n The vector x’(n) represent a modified form of the input vector. n The GHA finds the first l eigenvectors of the correlation matrix R, assuming that the associated egenvelues are distinct.

Principal Component Analysis Summary of the GHA

Principal Component Analysis n Adaptive principal components extraction (APEX) n The APEX algorithm uses both feedforward and feedback connections. n The algorithm is iterative in nature in that if we are given the first (j-1) principal components the jth principal component is computed.

Principal Component Analysis Feedforward connections from the input nodes to each of the neurons 1,2,...,j, with j<m. Of particular interest here are the feedforward connections to neuron j, these connections are represented by weight vector w j = [w j1 (n),w j2 (n),...,w jm (n)] T The feedforward connections operate in accordance with a Hebbian learning rule; they are excitatory and therefore provide for self- amplification. Lateral connections from the individual outputs of neurons 1,2,...,j-1 to neuron j, thereby applying feedback to the network. These connections are represented by the feedback weight vector a j (n) = [a j1 (n),a j2 (n),...,a jj-1 (n)] T The lateral connections operate in accordance with an anti-Hebbian learning rule, which has the effect of making them inhibitory.

Principal Component Analysis n The output y j (n) of neuron j is given by n y j (n) = w j T (n)x(n) + a j T (n)y j-1 (n) n The feedback signal vector y j-1 (n) is defined by the outputs of neurons 1,2,...,j- 1 n y j-1 (n) = [y 1 (n), y 2 (n),...,y m (n)] T n The input vector x(n) is drawn from a stationary process whose correlation matrix R has distinct eigenvalues arraged in decreased order. It is further assumed that neurons 1,2,...,j-1 of the network have already converged to their respective stable conditions n w k (0) = q k, k=1,2,...,j-1 n a k (0) = 0, k=1,2,...,j-1 n y j-1 (n) = Qx(n) n The requirement is to use neuron j in the network to compute the next largest eigenvalue i of the correlation matrix R of the input vector x(n) and the associated eigenvector q.

Principal Component Analysis n w j (n+1) = w j (n) +  [y j (n)x(n) - y j ²(n)w j (n)], n a j (n+1) = a j (n) -  [y j (n)y j- 1 (n) + y j ²(n)a j (n)], n To the learning parameter  should be assigned a sufficiently small value to ensure that lim w j (n) = q j,, n  ∞, lim  j ²(n) = j, n  ∞

Some Theory n reestimation algorithms (only feedforward connection) n decorrelating algorithms (both forward and feedback connections) GHA is a reestimation algorithm because w j (n+1) = w j (n) +  y j (n)[x i (n) – x^ j (n)],where x^ j (n) is the reestimator APEX is a decorrelating algorithm

Some Theory n Batch and adaptive methods n Eigendecomposition and singular value decomposition belong to the batch category. n GHA and APEX belong to adaptive category. n In theory, eigendecomposition is based on the ensemble- averaged correlation matrix R of a random vector X(n). n R^(n) = 1/N  n=1 N x(n)x T (n) n From a numerical perspective a better method is to use singular value decomposition (SVD) by applying it directly to the data matrix. For the set of observations {x(n)} N n=1, the data matrix is defined by A = [x(1), x(2),...,x(N)] T.

Some Theory n where k  m, and where m is the dimension of the observation vector. The numbers  1,  2,...,  k are called the sigular values of the data matrix A. n U is the left singular vector and V is the right singular vector. n The singular values of the data matrix A are the square roots of the eigenvalues of the estimate R^(N). n The left singular vectors of A are the eigenvectors of R^(N).

Some Theory n Adaptive methods work with an arbitrarily large sample size N. n The storage requirement of such methods is relatively modest (intermediate values of eigenvalues and associated eigenvectors do not have to be stored). n In a nonstationary environment, they have an inherent ability to track gradual changes.

Principal Component Analysis n Kernel Principal component analysis n The computations are performed in a feature space that is nonlinearly related to the input space. n The kernel PCA is nonlinear but the implementation of kernel PCA relies on linear algebra. n Let vector  (x j ) denote the image of an input vector x j induced in a feature space defined by the nonlinear map :  : R m0  R m1, where m 0 is the dimensionality of the input space and m 1 is the dimensionality of feature space. n Given the set of examples {x i } N n=1 we have a corresponding set of feature vectors {  (x i } N n=1. We may define an m 1 -by-m 1 correlation matrix in the feature space, denoted by R~. n R~ = 1/N  N i=1  (x i )  T (x i ) n R~q~ = ~q~

Principal Component Analysis n  N i=1  N j=1  j  (x i ) K(x i, x j ) = N ~  N j=1  j  (x j ) where K(x i, x j ) is an inner-product kernel defined in term of the feature vectors. K²α = N ~Kα where the squared matrix K² denotes the product of K with itself. Let 1 ≥ 2 ≥... ≥ N denote the eigenvalues of the kernel matrix K; that is j = N j ~, j= 1,2,..., N where j ~ is the jth eigenvalue of the correlation matris R~.  Kα = α

Principal Component Analysis The two-dimensional data consisting of components x 1 and x 2 are used. The x 1 –values have a uniform distribution in the interval [-1,1]. the x 2 –values are nonlinearly related to the x 1 –values by the formula: x 2 = x 1 ² + v where v is an additive Gaussian noise of zero mean and variance The results of PCA were obtained using kernel polynomials: K(x,x i ) = (x T x i ) d, d = 1,2,3,4

Principal Component Analysis n Linear PCA fails to provide an adequate representation of the nonlinea input data. n The first principal component varies monotonically along a parabola that underlies the input data n In the kernel PCA, the second and third principal components exhibit a behavior that appears somewhat similar for different values of polynomial degree d.

Summary The Hebbian-based algorithms are motivated by ideas taken from neurobiology. How useful is principal components analysis? If the main objective is to achieve good data compression while preserving as much information about the inputs as possible If it happens that there are a few clusters in the data set, then the leading principal axes found by using the principal component analysis will tend to pick projections of clusters with good separations.