Independent Component Analysis

Slides:

Advertisements

Similar presentations

Independent Component Analysis

Advertisements

Independent Component Analysis: The Fast ICA algorithm

Eigen Decomposition and Singular Value Decomposition

The Simple Regression Model

Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.

Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Visual Recognition Tutorial

Independent Component Analysis & Blind Source Separation

The Simple Linear Regression Model: Specification and Estimation

Chapter 5 Orthogonality

Independent Component Analysis (ICA)

Dimensional reduction, PCA

Independent Component Analysis & Blind Source Separation Ata Kaban The University of Birmingham.

Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.

Independent Component Analysis (ICA) and Factor Analysis (FA)

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.

An Introduction to Independent Component Analysis (ICA) 吳育德陽明大學放射醫學科學研究所台北榮總整合性腦功能實驗室.

CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.

A Quick Practical Guide to PCA and ICA Ted Brookings, UCSB Physics 11/13/06.

Bayesian belief networks 2. PCA and ICA

Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.

Random Variable and Probability Distribution

Lecture II-2: Probability Review

1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.

Survey on ICA Technical Report, Aapo Hyvärinen, 1999.

Review of Probability.

Independent Components Analysis with the JADE algorithm

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.

Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.

Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.

1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.

Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.

Blind Source Separation by Independent Components Analysis Professor Dr. Barrie W. Jervis School of Engineering Sheffield Hallam University England

Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Principal Component Analysis (PCA)

Joint Moments and Joint Characteristic Functions.

Independent Component Analysis Independent Component Analysis.

ICA and PCA 學生：周節教授：王聖智教授. Outline Introduction PCA ICA Reference.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.

An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.

Feature Extraction 主講人：虞台文.

1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.

Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

12. Principles of Parameter Estimation

Probability Theory and Parameter Estimation I

LECTURE 11: Advanced Discriminant Analysis

Brain Electrophysiological Signal Processing: Preprocessing

Bayesian belief networks 2. PCA and ICA

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Presented by Nagesh Adluru

The Simple Linear Regression Model: Specification and Estimation

A Fast Fixed-Point Algorithm for Independent Component Analysis

Feature space tansformation methods

Generally Discriminant Analysis

12. Principles of Parameter Estimation

Presentation transcript:

Independent Component Analysis Reference: Independent Component Analysis: A Tutorial by Aapo Hyvarinen, http:www.cis.hut.fi/projects/ica

Motivation of ICA The Cocktail-Party Problem 在Party上三個人在不同的位置,同時在說話 (S) 三人的聲音混雜在一起,無法分辨出誰說了什麼利用三隻麥克風，在不同的地點收聽會場中的聲音 (X) 是否可以將麥克風所錄到的聲音(X)還原回三個人原來的講話聲音 (S) 提出問題:雞尾酒酒會問題 Demo

Formulation of ICA two speech signal s1(t) and s2(t), received by two microphones, the mixed signals are: x1(t) and x2(t): It will be very useful if we could estimate the original signals s1(t) and s2(t), from only the recorded signals x1(t) and x2(t) 對問題做定義

Formulation of ICA Suppose aii’s are known, then solving the linear Equations 1 and 2 can retrieve the s1(t), s2(t) the problem is we don’t know the aii. One approach is to use some information on the statistical properties of signals si(t) to estimate aii Assume s1(t) and s2(t) are statistical independent, then Independent Component Analysis techniques can retrieve s1(t) and s2(t), from the mixture x1(t) and x2(t).

Original signals s1(t), s2(t) Mixture signals x1(t), x2(t) Recovered signals for s1(t), s2(t) 範例圖

Definition of ICA For n linear mixtures x1, …,xn from n independent components The independent component si are latent variables, meaning that they can not be directly observed, and the mixing matrix A is assumed to unknown. We would like to estimate both A and s using the observable random vector x and some statistical assumption

Definition of ICA X=AS ; Y=BX ; y is a copy of s If C is non-mixing then y=Cs is a copy of s A square matrix is said to be non-mixing if it has one and only one non-zero entry in each row and each column

Illustration of ICA We use two independent components with the following uniform distributions to illustrate the ICA model: The distribution has zero mean and the variance equal to one Let us mixing these two independent components with the following mixing matrix This gives us two mixed variable x1 and x2. The mixed data has a uniform distribution on a parallelogram. But x1 and x2 are not independent any more. Since when x1 attains to its maximum, or minimum, then this also determine the value of x2

Illustration of ICA S2 Fig 6. Joint density distribution of the observed mixtures x1 and x2 Fig 5. Joint density distribution of the original signal s1 and s2

Illustration of ICA The problem of estimating the data model of ICA is now to estimate the mixing matrix A0 using only information contained in the mixtures x1 and x2 . We can see from Fig 6 an intuitive way of estimating A: the edges of the parallelogram are in the directions of the columns of A. That is estimate the ICA model by first estimating the joint density of x1 and x2 , and then locating the edges. However, this only works for random variables with uniform distributions We need a method that works for any types of distribution

Ambiguities of ICA Because y=Bx is just a copy of S: we can not determine the variance (energies) of the independent components. we can not determine the order of the independent components. applying a permutation matrix P to x=As, i.e., x=AP-1Ps, then Ps is still like the original signals, and AP -1 is just a new unknown mixing matrix, to be solved by the ICA algorithms, the order of s will be changed.

Properties of ICA Independence the variables y1 and y2 are said to be independent if information on the value of y1 does not give any information the value of y2, and vice versa. Let p(y1, y2 ) be the joint probability density function (pdf) of y1 and y2, and let p(y1 ) be the marginal pdf of y1 then y1 and y2 are independent if and only if the joint pdf is factorizable. Thus, given two functions h1 and h2 , we always have

Properties of ICA Uncorrelated variables are only partly independent Two variables y1 and y2 are said to be uncorrelated if their covariance is zero If the variables are independent, they are uncorrelated, but the reverse is not true! For example: sin(x) and cos(x) is dependent on x, but cov(sin(x),cos(x))=0

Gaussian variables are forbidden The fundamental restriction in ICA is that the independent components must be nongaussian for ICA to be possible assume the mixing matrix is orthogonal and si are gaussian, then x1 and x2 are gaussian, uncorrelated, and of unit variance. The joint pdf is the distribution is completely symmetric (shown in figure next page), it does not contain any info on the direction of the columns of the mixing matrix A. Thus A can not be estimated

Fig 7. Multivariate distribution of two independent gaussian variables

ICA Basic source separation by ICA must go beyond second order statistics ignoring any time structure because the information contained in the data is exhaustively represented by the sample distribution of the observed vector source separation can be obtained by optimizing a ‘contrast function’ i.e.,: a function that measure independence.

Measures of independence Nongaussian is independent The key to estimate the ICA model is the nongaussianity The central limit theorem (CLT) tells us that the distribution of a sum of independent random variables tends toward a gaussian distribution. In other words, a mixing of two independent signals usually has a distribution that is closer to gaussian than any of the two original signals Suppose, we want to estimate y, one of the independent components of s from x, let us denotes this by y=WTx=Siwixi, w is a vector to be determined How can we use CLT to determine w so that it would equal to one of the rows of the inverse of A ?

Nongaussian is independent let us make a change of variables,z = ATw then we have y = wTx = wTAs = zTs = Sizisi thus y=zTs is more gaussian than the original variables si y becomes least gaussian, when it equals to one of si, a trivial way is to let only one of the elements zi of z be nonzero Maximizing the nongaussianity of wTx, gives us one of the independent components.

Measures of nongaussianity To use nongaussianity in ICA, we must have a quantitative measure of nongaussianity of a random variable yi Kurtosis the classical measure of nongaussianity is kurtosis or the fourth-order cumulant Assume y is of unit variance, then kurt(y)= E{y4}-3. A kurtosis is simply a normalized fourth moment E{y4} For a gaussian y, the fourth moment equals to 3(E{y2})2 thus, kurtosis is zero for a gaussian random variable.

Kurtosis Kurtosis can be both positive and negative RV have a negative kurtosis are called subgaussian subgaussian RV have typically a flat pdf, which is rather constant near zero, and very small for larger values uniform distribution is a typical example for subgaussian supergaussian RV have a spiky pdf, with a heavy tail Laplace distribution is a typical example for supergaussian

Kurtosis (c) Typically nongaussianity is measured by the absolute value of kurtosis. Kurtosis can be measured by using the fourth moments of the sample data if x1 and x2 are two independent RV, it holds To illustrate a simple example what optimization landscape for kurtosis looks like, let us look at a 2-d model x=As. We seek for one of the independent components as y = wTx let z = ATw, then y = wTx = wATs = zTs = z1 s1 + z2 s2 ,

Kurtosis (c) Using additive property of kurtosis, then we have kurt(y) = kurt (z1 s1)+ kurt(z2 s2)= z14 kurt (s1)+ z24 kurt (s2) let’s apply a constraint on y that the variance of y is equal to 1, that is the same assumption concerning s1 and s2. Thus, z: E{y2}=z12 + z22 =1, this means that the vector z is constrained to the unit circle on a 2-d plane. The optimization becomes “what are the maxima of the function | kurt(y)| =| z14 kurt (s1)+ z24 kurt (s2)| on the unit circle”? The maxima are the points where vector z is (0,1) or (0,-1). These points correspond to where y equals one of si and -si.

Kurtosis (c) In practice we could start from a weight vector w, and compute the direction in which the kurtosis of y=wTx is growing or decreasing most strongly based on the available sample x(1),…, x(T) of mixture vector x, and use a gradient method for finding a new vector w. However, kurtosis has some drawbacks, the main problem is that kurtosis can be very sensitive to outliers, in other words kurtosis is not a robust measure of nongaussianity. In the following sections, we would like to introduce negentropy, whose properties are rather opposite to those of kurtosis.

Negentropy Negentropy is based on the information-theoretic entropy. The entropy of a RV is a measure of the degree of randomness of the observed variables. The more unpredictable and unstructured the variable is, the larger is its entropy. Entropy is defined for a RV Y as: A fundamental property of information theory for gaussian variable is it has the largest entropy among all random variables of equal variance. Thus, entropy can be used to measure nongaussianity.

Negentropy To obtain a measure the nongaussianity that is zero for a gaussian variable and always nonnegative, one often uses Negentropy J, which is defined as: J(y)= H(yGauss)-H(y) --------------------(22) where ygauss is a gaussian RV of the same covariance matrix as y. the advantage of using Negentropy is it is in some sense the optimal estimator of nongaussianity, as far as statistical properties are concerned. The problem in using negentroy is that it is still computationally very difficult. Thus simpler approximations of negentropy seems necessary and useful.

Approximations of negentropy The classical method of approximating negentropy is using higher-order-moments, for example The RV y is assumed to be of zero mean and unit variance. This approach still suffer from the nonrobustness as kurtosis Another approximation were developed based on the maximum-entropy principle: Where  is a Gaussian variable of zero mean and unit variance, and G is a nonquadratic function

Approximations of negentropy Taking G (y) = y4, then (25) becomes (23) suppose G is chosen to be slow growing as the following contrast functions: This approximation is conceptually simple, fast to compute, and especially robustness. A practical algorithm based on these contrast function will be presented in Section 6

Preprocessing - centering Some preprocessing techniques make the problem of ICA estimation simpler and better conditioned. Centering Center variable x, i.e., subtract its mean vector m=E{x}, so as to make x a zero-mean variable. This preprocessing is solely to simplify the ICA algorithms After estimating the mixing matrix A with centered data, we can complete the estimation by adding the mean vector of s back to the centered estimates of s. the mean vector of s is given by A-1m, m is the mean vector that was subtracted in the preprocessing

Preprocessing - whitening Another preprocessing is to whiten the observed variables. Whitening means to transform the variable x linearly so that the new variable x~ is white, i.e., its components are uncorrelated, and their variances equal unity. In other words, variable x~ is white means the covariance matrix of x~ equals identity matrix:

Preprocessing - whitening The correlation  between two variables x and y is The covariance between x and y is The covariance Cov (x, y) can be computed by If two variable are uncorrelated then (x, y)= Cov(x, y) =0 Covariance matrix = I means that if x not equal to y, then Cov(x,y)=0. if a matrix’s covariance matrix is white, then it is uncorrelated.

Preprocessing - whitening Although uncorrelated variables are only partly independent, decorrelation (using second-order information) can be used to reduce the problem to a simpler form. Unwhitened matrix A needs n2 parameters, but whitened matrix needs lesser (about half) parameters

Fig 5 Fig 6 Fig 10 The graph to the right shows data in Fig 6 has been whitened. The square depicts the distribution is clearly a rotated version of original square in Fig 5. All that is left is the estimation of a single angle that gives rotation.

Preprocessing - whitening Whitening can be computed by eigenvalue decomposition (EVD) of the covariance matrix E{xxT}=EDET E is the orthogonal matrix of eigenvectors of E{xxT} D is a diagonal matrix of its eigenvalues, D=diag(d1,…,dn) note that E{xxT} can be estimated in a standard way from the available sample of x(1), …, x(T).

Preprocessing - whitening Whitening can now be computed by Where D-1/2 can be computed by D-1/2 =diag(d1-1/2 ,…, dn-1/2 ). It is easy to show , Using (34) and E{xxT}=EDET, then QED. According to x=As, thus whitening transform the mixing matrix to a new , and Since the new mixing matrix is orthogonal

The FastICA Algorithm - FastICA for one unit The FastICA learning rule finds a direction, i.e., a unit vector w such that the projection wTx maximizes nongaussianity, which is measured by the approximation of negentropy J(wTx). The variance of y=wTx must be constained to unity, for the whitened data, this is equivalent to constraining the norm of w to unity, i.e., E{(wTx)}2}=||w||2 =1. In the following algorithm, g denotes the derivative of the derivative of the nonquadratic function G.

FastICA for one unit The FastICA algorithm 1) choose an initial (e.g., random) weight vector w. 2) Let W+ =E{xg(wTx)}-h E{xg’ (wTx)}w 3) Let W = W+ / || W+ || , normalization improve stability 4) if not converged, go back to 2. The derivation is as follows: the optima of E{G(wTx)} under the constraint E{(wTx)}2}=||w||2 =1 are obtained at points, where F=E{xg(wTx)}-bw = 0-----(40) Solving this equation by Newton’s method w+=w–h (F/2F) The Jacobian matrix is F = ∂F/∂w = E{xxTg’ (wTx)}-bI And the Hessian matrix of F(w) is

FastICA for one unit In order to simplify the inversion of the Hessin matrix, the first part of the the Hessian is aproximated, as Since the data is in a unit sphere, thus E{xxTg’ (wTx)}  E{xxT} E{g’ (wTx)}= E{g’ (wTx)}I thus the Hessian matrix becomes diagonal, and it can be easily inverted, Then, the vector w can be approximated by Newton method: w+ = w – h (F/2F) = By multiplying b -E{g’(wTx)} on both side, then after algebraic simplification, it gives the FastICA iteration

FastICA for one unit (c) Discussion: Expectations must be replaced as estimates, which are sample means to compute sample mean, ideally all the data available should be used, but for the computational complexity, only part of or small size of samples are used, If convergence is not satisfactory, one may then increase the sample size.

FastICA for several units To etsimate several independent components, we need to run FastICA algorithm using several units, with weight vectors w1,…, wn. To prevent different vectors from converging to the same maxima we need to decorrelate the outputs w1T x,…, wnT x after every iteration. A simple way of decorrelation is to estimate the independent components one by one. when p independent components are estimated, i.e., w1,…, wp, run the one-unit fixed-point algorithm for wp+1 , and subtract from wp+1 the “projection” wTp+1 Cwj wj , j=1,…,p of the previously estimated p vectors, and then renormalize wp+1 :

FastICA for several units The covariance matrix C=E{xxT} is equal to I, if the data is sphered.

Applications of ICA - Finding Hidden Factors in Financial Data Some financial data such as currency exchange rates, or daily returns of stocks, that may have some common underling factors. ICA might reveal some driving mechanisms that otherwise remain hidden. In a recent study of a stock portfolio, it was found that ICA is a complementary tool to PCA, allowing the underlying structure of the data to be more readily observed.

Term project Using PCA, JADE, and FastICA to analyzes the Taiwan stocks returns for underling factors. JADE and FastICA packages can be found by searching on the www. Data are available at course web site. Due: