An Introduction to Independent Component Analysis (ICA) 吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能實驗室.

Slides:



Advertisements
Similar presentations
Independent Component Analysis: The Fast ICA algorithm
Advertisements

Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能研究室 Introduction To Linear Discriminant Analysis.
Visual Recognition Tutorial
Independent Component Analysis (ICA)
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能研究室 Introduction To Principal Component Analysis.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.
Bayesian belief networks 2. PCA and ICA
Techniques for studying correlation and covariance structure
Correlation. The sample covariance matrix: where.
The Multivariate Normal Distribution, Part 2 BMTRY 726 1/14/2014.
Dominant Eigenvalues & The Power Method
Today Wrap up of probability Vectors, Matrices. Calculus
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
Independent Components Analysis with the JADE algorithm
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Independent Component Analysis on Images Instructor: Dr. Longin Jan Latecki Presented by: Bo Han.
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Independent Component Analysis
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
Blind Separation of Speech Mixtures Vaninirappuputhenpurayil Gopalan REJU School of Electrical and Electronic Engineering Nanyang Technological University.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Chapter 5.6 From DeGroot & Schervish. Uniform Distribution.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent (If f(x) is more complex we usually cannot.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
2010/12/11 Frequency Domain Blind Source Separation Based Noise Suppression to Hearing Aids (Part 2) Presenter: Cian-Bei Hong Advisor: Dr. Yeou-Jiunn Chen.
HMM - Part 2 The EM algorithm Continuous density HMM.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Matrix Notation for Representing Vectors
Haar Wavelet Analysis 吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能實驗室.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Independent Component Analysis Independent Component Analysis.
ICA and PCA 學生:周節 教授:王聖智 教授. Outline Introduction PCA ICA Reference.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
HST.582J/6.555J/16.456J Gari D. Clifford Associate Director, Centre for Doctoral Training, IBME, University of Oxford
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
EM Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
13.3 Product of a Scalar and a Matrix.  In matrix algebra, a real number is often called a.  To multiply a matrix by a scalar, you multiply each entry.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ALGEBRAIC EIGEN VALUE PROBLEMS
Lectures 15: Principal Component Analysis (PCA) and
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
SOME IMPORTANT PROBABILITY DISTRIBUTIONS
Brain Electrophysiological Signal Processing: Preprocessing
PCA vs ICA vs LDA.
Bayesian belief networks 2. PCA and ICA
Techniques for studying correlation and covariance structure
Presented by Nagesh Adluru
A Fast Fixed-Point Algorithm for Independent Component Analysis
Feature space tansformation methods
Presentation transcript:

An Introduction to Independent Component Analysis (ICA) 吳育德 陽明大學放射醫學科學研究所 台北榮總整合性腦功能實驗室

The Principle of ICA: a cocktail-party problem x 1 (t)=a 11 s 1 (t) +a 12 s 2 (t) +a 13 s 3 (t) x 2 (t)=a 21 s 1 (t) +a 22 s 2 (t) +a 12 s 3 (t) x 3 (t)=a 31 s 1 (t) +a 32 s 2 (t) +a 33 s 3 (t)

Independent Component Analysis Reference : A. Hyvärinen, J. Karhunen, E. Oja (2001)A. HyvärinenJ. KarhunenE. Oja John Wiley & SonsJohn Wiley & Sons. Independent Component Analysis

Central limit theorem The distribution of a sum of independent random variables tends toward a Gaussian distribution Observed signal=IC1IC2 ICn m1m1 + m 2 ….+ m n toward GaussianNon-Gaussian

Central Limit Theorem Partial sum of a sequence {z i } of independent and identically distributed random variables z i Since mean and variance of x k can grow without bound as k , consider instead of x k the standardized variables The distribution of y k  a Gaussian distribution with zero mean and unit variance when k . Partial sum of a sequence {z i } of independent and identically distributed random variables z i

How to estimate ICA model Principle for estimating the model of ICA Maximization of NonGaussianity

Measures for NonGaussianity Kurtosis Super-Gaussian kurtosis > 0 Gaussian kurtosis = 0 Sub-Gaussian kurtosis < 0 Kurtosis : E{(x-  ) 4 }-3*[E{(x-  ) 2 }] 2 kurt(x 1 +x 2 )= kurt(x 1 ) + kurt(x 2 ) kurt(  x 1 ) =  4 kurt(x 1 )

Assume measurement Whitening process is zero mean and Then is a whitening matrix xEDVxz T 2 1   Iss  }{ T E sxA  T EDV 2 1   TTT EEVxxVzz}{}{    EDEDEED TT I  TT E xx  }{ Let D and E be the eigenvalues and eigenvector matrix of covariance matrix of x, i.e.

Importance of whitening For the whitened data z, find a vector w such that the linear combination y=w T z has maximum nongaussianity under the constrain Maximize | kurt(w T z)| under the simpler constraint that ||w||=1 Then

Constrained Optimization max F(w), ||w|| 2 =1 At the stable point, the gradient of F(w) must point in the direction of w, i.e. equal to w multiplied by a scalar. 0 ),(    w w L 0]2[ )(     w w w F 1),1()(),( 2 2  wwwwww T FL w w w     2 )(F

Gradient of kurtosis }]){([3}) )()( 224 zwzwzww TTT EEkurtF  w wwzw w zwzw w w           )(3))(( 1 }}){(3}) )( T T t T TT t T EE F ))((2*3)]()[( wwwwzwz    T T t T tt T ]3}][{))[((4 2 3 wwzwzzw  TT Ekurtsign    T t t T E 1 )( 1 }{yy 

Fixed-point algorithm using kurtosis w k+1 = w k +  Note that adding the gradient to w k does not change its direction, since Convergence : | |=1 since w k and w k+1 are unit vectors )(   wkwk F wkwk = ( +  2 ) )(   wkwk F wkwk 2 3 3}][{wwzwz Therefore, w  T E www/  w k+1 = w k -  ( w k ) 2 = (1- 2  )wkwk

Fixed-point algorithm using kurtosis 1. Centering 2. Whitening 3. Choose m, No. of ICs to estimate. Set counter p  1 4. Choose an initial guess of unit norm for w p, eg. randomly. 5. Let 6. Do deflation decorrelation 7.Let w p  w p /||w p || 8.If w p has not converged (| | 1 ), go to step 5. 9.Set p  p+1. If p  m, go back to step 4. One-by-one Estimation Fixed-point iteration

Fixed-point algorithm using negentropy The kurtosis is very sensitive to outliers, which may be erroneous or irrelevant observations Need to find a more robust measure for nongaussianity Approximation of negentropy ex. r.v. with sample size=1000, mean=0, variance=1, contains one value = 10  kurtosis at least equal to 10 4 /1000-3=7 kurtosis : E{x 4 }-3

Fixed-point algorithm using negentropy Entropy Negentropy Approximation of negentropy

w  E{zg(w T z)}  E{g '(w T z)} w w  w/||w|| Fixed-point algorithm using negentropy Convergence : | |=1 Max J(y)

Fixed-point algorithm using negentropy 1. Centering 2. Whitening 3. Choose m, No. of ICs to estimate. Set counter p  1 4. Choose an initial guess of unit norm for w p, eg. randomly. 5. Let 6. Do deflation decorrelation 7.Let w p  w p /||w p || 8.If w p has not converged, go back to step 5. 9.Set p  p+1. If p  m, go back to step 4. One-by-one Estimation Fixed-point iteration

Implantations Create two uniform sources

Implantations Create two uniform sources

Implantations Two mixed observed signals

Implantations Two mixed observed signals

Implantations Centering

Implantations Centering

Implantations Whitening

Implantations Whitening

Implantations Fixed-point iteration using kurtosis

Implantations Fixed-point iteration using kurtosis

Implantations Fixed-point iteration using kurtosis

Implantations Fixed-point iteration using negentropy

Implantations Fixed-point iteration using negentropy

Implantations Fixed-point iteration using negentropy

Implantations Fixed-point iteration using negentropy

Implantations Fixed-point iteration using negentropy

Fixed-point algorithm using negentropy Entropy A Gaussian variable has the largest entropy among all random variables of equal variance Entropy Negentropy Gaussian  0 nonGaussian  >0 A Gaussian variable has the largest entropy among all random variables of equal variance Negentropy Gaussian  0 nonGaussian  >0 It’s the optimal estimator but computationally difficult since it requires an estimate of the pdf Approximation of negentropy

Fixed-point algorithm using negentropy High-order cumulant approximation It’s quite common that most r.v. have approximately symmetric dist. Replace the polynomial function by any nonpolynomial fun G i ex.G 1 is odd and G 2 is even

Fixed-point algorithm using negentropy According to Lagrange multiplier  the gradient must point in the direction of w

Fixed-point algorithm using negentropy The iteration doesn't have the good convergence properties because the nonpolynomial moments don't have the same nice algebraic properties. Finding  by approximative Newton method Real Newton method is fast.  small steps for convergence It requires a matrix inversion at every step.  large computational load Special properties of the ICA problem  approximative Newton method No need a matrix inversion but converge roughly with the same steps as real Newton method

Fixed-point algorithm using negentropy According to Lagrange multiplier  the gradient must point in the direction of w Solve this equation by Newton method JF(w)*  w = -F(w)   w = [JF(w)] -1 [-F(w)] E{zz T }E{g ' (w T z)} = E{g ' (w T z)} I diagonalize w  w  [E{zg(w T z)}  w] /[E{g ' (w T z)}  ] w  E{zg(w T z)}  E{g ' (w T z)} w w  w/||w|| Multiply by  -E{g ' (w T z)}

w  E{zg(w T z)}  E{g ' (w T z)} w w  w/||w|| Fixed-point algorithm using negentropy Convergence : | |=1 Because ICs can be defined only up to a multiplication sign