1 2. The PARAFAC model Quimiometria Teórica e Aplicada Instituto de Química - UNICAMP
2 Example: fluorescence data (1) Each fluorescence spectrum is a matrix of emission vs excitation wavelengths: X i (201 61)
3 Example: fluorescence data (2) Each spectrum is a linear sum of three components: tryptophan, phenylalanine and tyrosine. X i = a i1 b 1 c 1 T + a i2 b 2 c 2 T + a i3 b 3 c 3 T + E i concentration of tryptophan in sample i emission spectrum of pure tryptophan excitation spectrum of pure tryptophan XiXi = b1b1 c1Tc1T a i1 b2b2 c2Tc2T a i2 + b3b3 c3Tc3T a i3 + + Ei+ Ei
4 Example: fluorescence data (3) Five samples were measured and stacked to give a three-way array: X (5 201 61). X5X5 X4X4 X3X3 X2X2 X1X1 5 samples 201 emission ’s 61 excitation ’s = b1Tb1T c1Tc1T a1a1 b2Tb2T c2Tc2T a2a2 + b3Tb3T c3Tc3T a3a3 + + E concentration of tryptophan in each sample
5 Example: fluorescence data (4) If we are given a set of fluroescence spectra, X, how can we determine: –How many chemical species are present? –Which chemical species are present? What are their pure excitation and emission spectra? i.e. self-modelling curve resolution (SMCR) –What is the concentration of each species in each sample? i.e. (second-order) calibration Answer: use the PARAFAC model!
6 The PARAFAC model (1) E BTBT CTCT A + = K X J I = b2Tb2T c2Tc2T a2a2 + cRTcRT bRTbRT aRaR … + + E c1Tc1T b1Tb1T a1a1 Triad }
7 The PARAFAC model (2) Loadings –A (I R) describes variation in the first mode. –B (J R) describes variation in the second mode. –C (K R) describes variation in the third mode. Residuals –E (I J K) are the model residuals. E BTBT CTCT A + = K X J I
8 Example: fluorescence data (5) Loadings –A (5 3) describes the component concentrations. –B (201 3) describes the pure component emission spectra. –C (61 3) describes the pure component excitation spectra. Residuals –E (5 201 61) describes instrument noise. E BTBT CTCT A + = X 5 samples 201 emission ’s 61 excitation ’s
9 Example: fluorescence data (6) A 3-component PARAFAC model describes 99.94% of X. B (201 3)C (61 3) phenylalanine tyrosine tryptophan tyrosine phenylalanine
10 Example: fluorescence data (7) The A-loadings describe the relative amounts of species 1 (tryptophan), 2 (tyrosine) and 3 (phenylalanine) in each sample: In order to know the absolute amounts, it is necessary to use a standard of known concentrations, i.e. sample 5. A (5 3) Concentrations (ppm)
11 The PARAFAC formula Data array –X (I J K) is matricized into X I JK (I JK) X I JK = A(C B) T + E I JK Loadings –A (I R) describes variation in the first mode –B (J R) describes variation in the second mode –C (K R) describes variation in the third mode Residuals –E (I J K) is matricized into E I JK (I JK) Khatri-Rao matrix product
12 PCA vs PARAFAC PCA Bilinear model X = AB T + E PARAFAC Trilinear model X I JK = A(C B) T + E I JK Components are calculated sequentially in order of importance. Components are calculated simultaneously in random order. Solution is unique (i.e. not possible to rotate factors without losing fit). Solution has rotational freedom. Orthogonal, i.e. B T B = INot (usually) orthgonal.
13 Rotational freedom The bilinear model X = AB T + E contains rotational freedom. There are many sets of loadings (and scores) which give exactly the same residuals, E: X = AB T + E = ARR -1 B T + E = A*B* T + E (A*=AR B* T =R -1 B T ) This model is not unique – there are many different sets of loadings which give the same % fit.
14 PARAFAC solution is unique The trilinear model X = A(C B) T + E is said to be unique, because it is not possible to rotate the loadings without changing the residuals, E: X = A(C B) T + E = ARR -1 (C B) T + E = A*(C* B*) T + E* This is why PARAFAC is able to find the correct fluorescence profiles – because the unique solution is close to the true solution.
15 Spot the difference! PCA loadings PARAFAC loadings
16 Alternating least squares (ALS) How to estimate the PCA model X = AB T + E? Step 0 - Initialize B Step 1 - Estimate A using least squares: Step 2 - Estimate B using least squares: Step 3 - Check for convergence - if not, go to Step 1. Each update must reduce the sum-of-squares,
17 Three different unfoldings – the formula is symmetric X I JK = A(C B) T + E I JK X J KI = B(A C) T + E J KI X K IJ = C(B A) T + E K IJ or X I JK X J KI X K IJ
18 How is the PARAFAC model calculated? Step 0 - Initialize B & C Step 1 - Estimate A: Step 4: Check for convergence. If not, go to Step 1. Step 3 - Estimate C in same way: Step 2 - Estimate B in same way: How to estimate the model X = A(C B) T + E?
19 Good initialization is sometimes important Initialization methods –random numbers (do this ten times and compare models) –use another method to give rough estimate (e.g. DTLD, MCR) –use sensible guesses (e.g. elution profiles are Gaussian) response surface initialize B & Cgood solution local minium initialize B* & C* ALS
20 Conclusions (1) The PARAFAC model decomposes a three-way array array into three sets of loadings – one for each ‘mode’.Each set of loadings describes the variation in that mode, e.g. differences in concentration, changes in time, spectral profiles etc. PARAFAC components are calculated together and have no particular order. PARAFAC components are not orthogonal and cannot be rotated. PARAFAC can be used for curve resolution and for calibration.
21 Conclusions (2) Some data sets have a chemical structure which is particularly suitable for the PARAFAC model, e.g. fluorescence spectroscopy. The PARAFAC model can also be used for four-way, five-way, N-way etc. data by simply using more sets of loadings.