Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiway Data Analysis

Similar presentations


Presentation on theme: "Multiway Data Analysis"— Presentation transcript:

1 Multiway Data Analysis
Johan Westerhuis Biosystems Data Analysis Swammerdam Institute for Life Sciences Universiteit van Amsterdam Even vertellen dat het zo leuk is om weer terug te zijn. Dat het erg is veranderd maar dat ze in mooie gebouwen zitten.

2 The “future” science faculty of the Universiteit van Amsterdam

3 The Biosystems Data Analysis group officially started in 2004 as a follow up of the process analysis group at the Universiteit van Amsterdam. Its aims are: Developing and validation of new data analysis methods for summarizing and visualizing complex structured biological data (Metabolomics / Proteomics).

4 Three-way Data Three-way Models Three-way Applications

5 Three-way Data

6 Three-way data Three-way data is a set of two-way matrices of the same objects and variables. IR, Raman, NMR spectra of the same samples will not give a three-way data set, but a multi-block data set. IR Raman NMR

7 Examples of three-way data
Time Emission UV Batch Process Fluorescence Chromato graphy Batches Samples Samples Process variables Excitation Chromatogram Judges RGB Sensory Analysis Image Analysis Products Image Attributes Image

8 From noway to multi-way
1 Scalar J J K 1 K 1 J 1 4-way 1-way 1 L I I I K J K J J 5-way 2-way 1 L I I I J K J K K J 3-way M I I I

9 Slabs and tubes Vertical tube Frontal slab Vertical slab Lateral tube
Horizontal tube Horizontal slab

10 Three slabs of fluorescence data 5 Samples x 60 Excitation x 200 Emission

11 Three-way batch process data
‘Engineering’ process data i.e. temperature, pressure, flow rate Spectroscopic process data i.e. NIR, Raman, UV-Vis process variable time batch One batch A series of batches X (J  K) X (I  J  K)

12 SBR batch process data Engineering variables

13 Spectroscopic three-way batch data
2 batch runs of a reaction followed with UV-Vis spectroscopy during 45 minutes

14 Batch Fermentation in two steps: Threeway multiblock
API Inoculum Batches Time Variables Batches Fermentation Time Variables

15 Four-way data in combinatorial catalysis
What we measure ... Composition Conditions Composition Conditions What we want

16 Multiway data from the Omics age
Metabolites Experiments Time Gene expression Experiments Time

17 Three-way Models

18 Some history M.C. Escher: Small problem with orthogonality

19 More history Psychometrics (1944-1980) Chemistry
Catell 1944: Parallel Proportional profiles (Common factors fitted simultaneously to many data matrices). Tucker 1964: Tucker models Carroll & Chang 1970: Canonical Decomposition (CANDECOMP) Harshman 1970: Parallel Factor Analysis (PARAFAC) Chemistry Ho 1978: Rank Annihilation (close to Parafac) on fluorescence data. End 80’s beginning 90’s: Threeway methods to resolve LC-UV data.

20 Multiway PCA: Unfolding of three-way data
J K J JK K I I I J IK Wold MacGregor

21 Two ways of unfolding Different assumptions in MSPC
Wold Nonlinear behavior in the data Batch trajectories are monitored Online monitoring MacGregor Nonlinearities removed Whole batch is considered a measurement Off-line monitoring

22 Extension of SVD to Parafac
U VT v1T v2T = = + S u1 u2 b1 b2 B c1 c2 X A CT + = G = a1 a2

23 Parafac / Candecomp Parafac is not sequential
Need to re-estimate whole model when more components are calculated [no deflation]. Parafac solution is unique No rotational freedom Changing parameters will reduce the fit. NB! A PCA model is not unique X = T*PT + E = T*R*R-1*PT + E = C*ST + E Unique ≠ true

24 Extension of Two Mode component Analysis (TMCA)
R X A G CT = P R B Q Q P X A CT G Tucker III P = R R

25 Tucker models Tucker I, Tucker II, Tucker III G X A Equals MPCA CT X G
= CT X G A = B CT G X A =

26 Tucker models Core array can be fully filled
PxQxR triads (1,1,1 / 1,1,2 / 1,2,1 etc) Not unique rotational freedom Components can be rotated towards orthogonality. Not sequential Restricted Tucker models can be developed when using prior chemical knowledge

27 Number of parameters X(IxJxK) example I=50, J=9, K=100,
P = Q = R = 3 Parafac: Rx(I + J + K) Tucker3: PxI + QxJ + RxK + PxQxR 504 MPCA: Rx(I + JK) Fit MPCA > Parafac (Overfit?)

28 Soft models vs hard models
Two-way bilinear model: Beer’s law PCA Trilinear model: Parafac Fluorescence No orthogonal constraints Orthogonal constraints No orthogonal constraints

29 Multiway Regression I Two step approach: y X Y
Decomposition of X to A and model Regression of y on A Can be Parafac, Tucker, MPCA etc No information of Y is used in the decomposition Similar to PCR method

30 Multiway Regression II
X Y Direct approach Now X is decomposed with y in mind. This leads to a not optimal decomposition of X but an improved fit of y.

31 When data are not exactly 3-way
batch time process variable Time Indicator variable Time / Variable variable Indicator variable Time

32 Alignment problems Peakshifts in LCMS/GCMS
Warping methods to align the peaks Dynamic Time Warping Correlation optimized warping

33 Three-way Applications

34 Fluorescence data 5 samples with varying concentration of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. Excitation wavelength: 240 – 300 nm Emission wavelength: 250 – 450 nm

35 Unfold PCA model of Fluorescence data
99.97% explained with 3 PC’s Loadings refolded into Excitation / Emission form Overfit of data: Loading 2 has negative parts. This is not according fluorescence theory.

36 Parafac model of Fluorescence data
99.93% explained variation: Good Fit Loadings are very well interpretable. Intensity in A mode can be related to concentration B and C mode A mode

37 Fluorescence data Florescence data perfectly fits the trilinear model that is applied by Parafac Due to uniqueness property of Parafac, the loadings found will perfectly resemble the Emission spectra and Excitation spectra of the three compounds in de mixtures. This is a nice example of Mathematical chromatography

38 Batch reaction monitoring
Pseudo-first-order reaction: A + BC D + E UV-Vis spectrum ( nm) measured every 10 seconds. Obeys Lambert-Beer law 35 NOC batches. X (35  201  271) In addition, some disturbed batches were measured pH disturbance during the reaction Temperature change Impurity

39 Aims and goals of research I
Data modelling: Improve understanding of process by interpretation of model parameters Analysis of historical batches: Are the current process measurements able to distinguish between ‘good’ and ‘bad’ batches? On-line monitoring: Rapid fault detection Easier fault diagnosis: what is the cause of the fault? Prediction of batch duration Het model dat van de historische batches wordt gemaakt moet robust zijn. Het is namelijk de bedoeling dat het lange tijd wordt gebruikt. Dat betekent dat het tegen kleine verstoringen in het proces bestand moet zijn. Toch moet het erg gevoelig zijn voor verstoringen die nieuw zijn. Ten tweede moet het model goed te interpreteren zijn. Dit is voor de interpretatie van proces verstoringen erg belangrijk. De eerste stap in het bouwen van het model is de selectie van goede batches, en dit model moet goede en foute batches van elkaar kunnen onderscheiden. Het uiteindelijke doel is on-line monitoren van het batch proces. Elke 10 seconden komt er een meting binnen en op grond daarvan wordt gekeken of de batch nog in control is. Voordelen van on-line monitoren is: Snelle fout detectie Omdat je precies weet wanneer in het proces de fout gebeurt, is diagnose gemakkelijker, Voorspellen van duur van de batch ( in het geval batches niet dezelfde tijdsduur hebben).

40 Aims and goals of research II
Which batch is different ?

41 Unfold PCA model Unfold keeping the batch direction (IxJK) PT X T E =
+

42 Unfold PCA model Many parameters estimated, likely to overfit the data

43 Unrestricted Parafac model
The simplest three-way model is the PARAFAC model: C = + I B X E batch time A wavelengths

44 Unrestricted Parafac model
Loadings are highly correlated - solution may be unstable. Model is difficult to interpret. 99.4% fit Can external knowledge of the process be used to improve the model?

45 Grey Modelling of batch data
‘Black-box’ or ‘soft’ models are empirical models which aim to fit the data as well as possible e.g. PCA, neural networks. ‘White’ or ‘hard’ models use known external knowledge of the process e.g. physicochemical model, mass-energy balances. + Easy to interpret Not always available Good fit Difficult to interpret Good fit ‘Grey’ or ‘hybrid’ models combine the two. University of Amsterdam

46 Modelling batch data + + = E X white part black part
Systematic variation due to known causes Systematic variation due to unknown causes Unsystematic variation Total variation

47 External information Incorporating external information can
increase model interpretability increase model stability Pure Spectra Reaction kinetics

48 Restricted ‘white’ model
External information is introduced in the form of parameter restrictions: REACTION KINETICS KNOWN SPECTRA C = + G B X E batch time A wavelengths LAMBERT-BEER LAW

49 Restricted Tucker model
Model is stable. 97.6% fit - lower than for black model Some systematic variation in the data is left unexplained by this model.

50 Grey model White components Black components
describe known effects can be interpreted 99.8% fit (corresponds well with estimated level of spectral noise of  0.13%)

51 Core array of restricted Tucker model
Only combinations: g111,a1,b1,c1 g122,a1,b2,c2 g133,a1,b3,c3 g244,a2,b4,c4 g355,a3,b5,c5 g 0 g g g244 0 g355 G 3x5x5 core array

52 Grey model residuals

53 Properties of grey models
White and black model parts can be calculated simultaneously (via restricted core matrix) with better % fit sequentially with better diagnostics - allows partitioning of variance 100% = 97.1% + 1.9% + 0.2% simultaneously but with orthogonality restrictions which also allow partitioning of variance

54 Off-line batch monitoring
NOC: # 1:32 Validation: # 33-35 pH Disturbed: # 36 Temp. problem # 37 Impurity # 38

55 On-line monitoring of a validation batch
5 10 15 20 25 30 35 40 45 1 2 Time ln(D-statistic) On-line monitoring of batch 33: D-statistic with 95% and 99% confidence limits -5 ln(SPE) On-line monitoring of batch 33: SPE with 95% and 99% confidence limits

56 On-line monitoring of the pH disturbed batch
After 23 minutes SPE goes outside control limits pH was disturbed after 21 minutes Only small change in D-statistic

57 On-line monitoring of the temperature disturbed batch
Temperature slowly decreasing from start of reaction Rate constant k1 lower than usual. Contribution plot shows difference spectrum between reactant (too high) and intermediate (too low)

58 Look at Rasmus Bro’s website
Want to know more Look at Rasmus Bro’s website


Download ppt "Multiway Data Analysis"

Similar presentations


Ads by Google