Independent Component Analysis

Name: Independent Component Analysis
Uploaded: 2017-07-11T19:20:11+00:00
Duration: PTM24S4
Channel: Ralf Roberts
Description: Independent Component Analysis

Independent Component Analysis
Reference: Independent Component Analysis: A Tutorial by Aapo Hyvarinen,

Motivation of ICA The Cocktail-Party Problem
在Party上三個人在不同的位置,同時在說話 (S) 三人的聲音混雜在一起,無法分辨出誰說了什麼利用三隻麥克風，在不同的地點收聽會場中的聲音 (X) 是否可以將麥克風所錄到的聲音(X)還原回三個人原來的講話聲音 (S) 提出問題:雞尾酒酒會問題 Demo

Formulation of ICA two speech signal s1(t) and s2(t), received by two microphones, the mixed signals are: x1(t) and x2(t): It will be very useful if we could estimate the original signals s1(t) and s2(t), from only the recorded signals x1(t) and x2(t) 對問題做定義

Formulation of ICA Suppose aii’s are known, then solving the linear Equations 1 and 2 can retrieve the s1(t), s2(t) the problem is we don’t know the aii. One approach is to use some information on the statistical properties of signals si(t) to estimate aii Assume s1(t) and s2(t) are statistical independent, then Independent Component Analysis techniques can retrieve s1(t) and s2(t), from the mixture x1(t) and x2(t).

Original signals s1(t), s2(t) Mixture signals x1(t), x2(t) Recovered
signals for s1(t), s2(t) 範例圖

Definition of ICA For n linear mixtures x1, …,xn from n independent components The independent component si are latent variables, meaning that they can not be directly observed, and the mixing matrix A is assumed to unknown. We would like to estimate both A and s using the observable random vector x and some statistical assumption

Definition of ICA X=AS ; Y=BX ; y is a copy of s
If C is non-mixing then y=Cs is a copy of s A square matrix is said to be non-mixing if it has one and only one non-zero entry in each row and each column

Illustration of ICA We use two independent components with the following uniform distributions to illustrate the ICA model: The distribution has zero mean and the variance equal to one Let us mixing these two independent components with the following mixing matrix This gives us two mixed variable x1 and x2. The mixed data has a uniform distribution on a parallelogram. But x1 and x2 are not independent any more. Since when x1 attains to its maximum, or minimum, then this also determine the value of x2

Illustration of ICA S2 Fig 6. Joint density distribution of the observed mixtures x1 and x2 Fig 5. Joint density distribution of the original signal s1 and s2

Illustration of ICA The problem of estimating the data model of ICA is now to estimate the mixing matrix A0 using only information contained in the mixtures x1 and x2 . We can see from Fig 6 an intuitive way of estimating A: the edges of the parallelogram are in the directions of the columns of A. That is estimate the ICA model by first estimating the joint density of x1 and x2 , and then locating the edges. However, this only works for random variables with uniform distributions We need a method that works for any types of distribution

Ambiguities of ICA Because y=Bx is just a copy of S:
we can not determine the variance (energies) of the independent components. we can not determine the order of the independent components. applying a permutation matrix P to x=As, i.e., x=AP-1Ps, then Ps is still like the original signals, and AP -1 is just a new unknown mixing matrix, to be solved by the ICA algorithms, the order of s will be changed.

Properties of ICA Independence
the variables y1 and y2 are said to be independent if information on the value of y1 does not give any information the value of y2, and vice versa. Let p(y1, y2 ) be the joint probability density function (pdf) of y1 and y2, and let p(y1 ) be the marginal pdf of y1 then y1 and y2 are independent if and only if the joint pdf is factorizable. Thus, given two functions h1 and h2 , we always have

Properties of ICA Uncorrelated variables are only partly independent
Two variables y1 and y2 are said to be uncorrelated if their covariance is zero If the variables are independent, they are uncorrelated, but the reverse is not true! For example: sin(x) and cos(x) is dependent on x, but cov(sin(x),cos(x))=0

Gaussian variables are forbidden
The fundamental restriction in ICA is that the independent components must be nongaussian for ICA to be possible assume the mixing matrix is orthogonal and si are gaussian, then x1 and x2 are gaussian, uncorrelated, and of unit variance. The joint pdf is the distribution is completely symmetric (shown in figure next page), it does not contain any info on the direction of the columns of the mixing matrix A. Thus A can not be estimated

Fig 7. Multivariate distribution of two independent gaussian variables

ICA Basic source separation by ICA must go beyond second order statistics ignoring any time structure because the information contained in the data is exhaustively represented by the sample distribution of the observed vector source separation can be obtained by optimizing a ‘contrast function’ i.e.,: a function that measure independence.

Measures of independence
Nongaussian is independent The key to estimate the ICA model is the nongaussianity The central limit theorem (CLT) tells us that the distribution of a sum of independent random variables tends toward a gaussian distribution. In other words, a mixing of two independent signals usually has a distribution that is closer to gaussian than any of the two original signals Suppose, we want to estimate y, one of the independent components of s from x, let us denotes this by y=WTx=Siwixi, w is a vector to be determined How can we use CLT to determine w so that it would equal to one of the rows of the inverse of A ?

Nongaussian is independent
let us make a change of variables,z = ATw then we have y = wTx = wTAs = zTs = Sizisi thus y=zTs is more gaussian than the original variables si y becomes least gaussian, when it equals to one of si, a trivial way is to let only one of the elements zi of z be nonzero Maximizing the nongaussianity of wTx, gives us one of the independent components.

Measures of nongaussianity
To use nongaussianity in ICA, we must have a quantitative measure of nongaussianity of a random variable yi Kurtosis the classical measure of nongaussianity is kurtosis or the fourth-order cumulant Assume y is of unit variance, then kurt(y)= E{y4}-3. A kurtosis is simply a normalized fourth moment E{y4} For a gaussian y, the fourth moment equals to 3(E{y2})2 thus, kurtosis is zero for a gaussian random variable.

Kurtosis Kurtosis can be both positive and negative
RV have a negative kurtosis are called subgaussian subgaussian RV have typically a flat pdf, which is rather constant near zero, and very small for larger values uniform distribution is a typical example for subgaussian supergaussian RV have a spiky pdf, with a heavy tail Laplace distribution is a typical example for supergaussian

Kurtosis (c) Typically nongaussianity is measured by the absolute value of kurtosis. Kurtosis can be measured by using the fourth moments of the sample data if x1 and x2 are two independent RV, it holds To illustrate a simple example what optimization landscape for kurtosis looks like, let us look at a 2-d model x=As. We seek for one of the independent components as y = wTx let z = ATw, then y = wTx = wATs = zTs = z1 s1 + z2 s2 ,

Kurtosis (c) Using additive property of kurtosis, then we have
kurt(y) = kurt (z1 s1)+ kurt(z2 s2)= z14 kurt (s1)+ z24 kurt (s2) let’s apply a constraint on y that the variance of y is equal to 1, that is the same assumption concerning s1 and s2. Thus, z: E{y2}=z12 + z22 =1, this means that the vector z is constrained to the unit circle on a 2-d plane. The optimization becomes “what are the maxima of the function | kurt(y)| =| z14 kurt (s1)+ z24 kurt (s2)| on the unit circle”? The maxima are the points where vector z is (0,1) or (0,-1). These points correspond to where y equals one of si and -si.

Kurtosis (c) In practice we could start from a weight vector w, and compute the direction in which the kurtosis of y=wTx is growing or decreasing most strongly based on the available sample x(1),…, x(T) of mixture vector x, and use a gradient method for finding a new vector w. However, kurtosis has some drawbacks, the main problem is that kurtosis can be very sensitive to outliers, in other words kurtosis is not a robust measure of nongaussianity. In the following sections, we would like to introduce negentropy, whose properties are rather opposite to those of kurtosis.

Negentropy Negentropy is based on the information-theoretic entropy.
The entropy of a RV is a measure of the degree of randomness of the observed variables. The more unpredictable and unstructured the variable is, the larger is its entropy. Entropy is defined for a RV Y as: A fundamental property of information theory for gaussian variable is it has the largest entropy among all random variables of equal variance. Thus, entropy can be used to measure nongaussianity.

Negentropy To obtain a measure the nongaussianity that is zero for a gaussian variable and always nonnegative, one often uses Negentropy J, which is defined as: J(y)= H(yGauss)-H(y) (22) where ygauss is a gaussian RV of the same covariance matrix as y. the advantage of using Negentropy is it is in some sense the optimal estimator of nongaussianity, as far as statistical properties are concerned. The problem in using negentroy is that it is still computationally very difficult. Thus simpler approximations of negentropy seems necessary and useful.

Approximations of negentropy
The classical method of approximating negentropy is using higher-order-moments, for example The RV y is assumed to be of zero mean and unit variance. This approach still suffer from the nonrobustness as kurtosis Another approximation were developed based on the maximum-entropy principle: Where  is a Gaussian variable of zero mean and unit variance, and G is a nonquadratic function

Approximations of negentropy
Taking G (y) = y4, then (25) becomes (23) suppose G is chosen to be slow growing as the following contrast functions: This approximation is conceptually simple, fast to compute, and especially robustness. A practical algorithm based on these contrast function will be presented in Section 6

Preprocessing - centering
Some preprocessing techniques make the problem of ICA estimation simpler and better conditioned. Centering Center variable x, i.e., subtract its mean vector m=E{x}, so as to make x a zero-mean variable. This preprocessing is solely to simplify the ICA algorithms After estimating the mixing matrix A with centered data, we can complete the estimation by adding the mean vector of s back to the centered estimates of s. the mean vector of s is given by A-1m, m is the mean vector that was subtracted in the preprocessing

Preprocessing - whitening
Another preprocessing is to whiten the observed variables. Whitening means to transform the variable x linearly so that the new variable x~ is white, i.e., its components are uncorrelated, and their variances equal unity. In other words, variable x~ is white means the covariance matrix of x~ equals identity matrix:

The correlation  between two variables x and y is The covariance between x and y is The covariance Cov (x, y) can be computed by If two variable are uncorrelated then (x, y)= Cov(x, y) =0 Covariance matrix = I means that if x not equal to y, then Cov(x,y)=0. if a matrix’s covariance matrix is white, then it is uncorrelated.

Although uncorrelated variables are only partly independent, decorrelation (using second-order information) can be used to reduce the problem to a simpler form. Unwhitened matrix A needs n2 parameters, but whitened matrix needs lesser (about half) parameters

Fig 5 Fig 6 Fig 10 The graph to the right shows data in Fig 6 has been whitened. The square depicts the distribution is clearly a rotated version of original square in Fig 5. All that is left is the estimation of a single angle that gives rotation.

Whitening can be computed by eigenvalue decomposition (EVD) of the covariance matrix E{xxT}=EDET E is the orthogonal matrix of eigenvectors of E{xxT} D is a diagonal matrix of its eigenvalues, D=diag(d1,…,dn) note that E{xxT} can be estimated in a standard way from the available sample of x(1), …, x(T).

Whitening can now be computed by Where D-1/2 can be computed by D-1/2 =diag(d1-1/2 ,…, dn-1/2 ). It is easy to show , Using (34) and E{xxT}=EDET, then QED. According to x=As, thus whitening transform the mixing matrix to a new , and Since the new mixing matrix is orthogonal

The FastICA Algorithm - FastICA for one unit
The FastICA learning rule finds a direction, i.e., a unit vector w such that the projection wTx maximizes nongaussianity, which is measured by the approximation of negentropy J(wTx). The variance of y=wTx must be constained to unity, for the whitened data, this is equivalent to constraining the norm of w to unity, i.e., E{(wTx)}2}=||w||2 =1. In the following algorithm, g denotes the derivative of the derivative of the nonquadratic function G.

FastICA for one unit The FastICA algorithm
1) choose an initial (e.g., random) weight vector w. 2) Let W+ =E{xg(wTx)}-h E{xg’ (wTx)}w 3) Let W = W+ / || W+ || , normalization improve stability 4) if not converged, go back to 2. The derivation is as follows: the optima of E{G(wTx)} under the constraint E{(wTx)}2}=||w||2 =1 are obtained at points, where F=E{xg(wTx)}-bw = (40) Solving this equation by Newton’s method w+=w–h (F/2F) The Jacobian matrix is F = ∂F/∂w = E{xxTg’ (wTx)}-bI And the Hessian matrix of F(w) is

FastICA for one unit In order to simplify the inversion of the Hessin matrix, the first part of the the Hessian is aproximated, as Since the data is in a unit sphere, thus E{xxTg’ (wTx)}  E{xxT} E{g’ (wTx)}= E{g’ (wTx)}I thus the Hessian matrix becomes diagonal, and it can be easily inverted, Then, the vector w can be approximated by Newton method: w+ = w – h (F/2F) = By multiplying b -E{g’(wTx)} on both side, then after algebraic simplification, it gives the FastICA iteration

FastICA for one unit (c)
Discussion: Expectations must be replaced as estimates, which are sample means to compute sample mean, ideally all the data available should be used, but for the computational complexity, only part of or small size of samples are used, If convergence is not satisfactory, one may then increase the sample size.

FastICA for several units
To etsimate several independent components, we need to run FastICA algorithm using several units, with weight vectors w1,…, wn. To prevent different vectors from converging to the same maxima we need to decorrelate the outputs w1T x,…, wnT x after every iteration. A simple way of decorrelation is to estimate the independent components one by one. when p independent components are estimated, i.e., w1,…, wp, run the one-unit fixed-point algorithm for wp+1 , and subtract from wp+1 the “projection” wTp+1 Cwj wj , j=1,…,p of the previously estimated p vectors, and then renormalize wp+1 :

FastICA for several units
The covariance matrix C=E{xxT} is equal to I, if the data is sphered.

Applications of ICA - Finding Hidden Factors in Financial Data
Some financial data such as currency exchange rates, or daily returns of stocks, that may have some common underling factors. ICA might reveal some driving mechanisms that otherwise remain hidden. In a recent study of a stock portfolio, it was found that ICA is a complementary tool to PCA, allowing the underlying structure of the data to be more readily observed.

Term project Using PCA, JADE, and FastICA to analyzes the Taiwan stocks returns for underling factors. JADE and FastICA packages can be found by searching on the www. Data are available at course web site. Due:

Independent Component Analysis

Similar presentations

Presentation on theme: "Independent Component Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Independent Component Analysis

Similar presentations

Presentation on theme: "Independent Component Analysis"— Presentation transcript:

Similar presentations

About project

Feedback