Multivariate Analysis For Omics Data

Multivariate Analysis For Omics Data
Example of PCA and OPLS-DA applied to Metabolomics Dr Pierre Pétriacq

A complex situation… David S Wishart, Bioanalysis
What’s the situation ? People use fancy names to call fields of science that focus on particular biological levels such as genomics, transcriptomics, proteomics and metabolomics. From this picture, we could clearly see that complexity arises with the shift from genomics to post-genomics sciences like proteins studies and metabolites analyses. In this case, we shift from 4 different bases to something like chemicals. David S Wishart, Bioanalysis

Characteristics of metabolomics data
Unknown number and identity of metabolites Contrast with microarray/sequence data High degree of collinearity Biological / analytical correlations Multiple analytical methods NMR, LC-MS, GC-MS, CE-MS, MALDI-MS, etc. How to analyse? What are the characteristics of metabolomics data? Techniques used in metabolomics are often high throughput and thus provide a fairly high dimensionality. In other words, the number of variables (here analytes) are much bigger than the number of samples (could be big too). The collinearity is high too with many correlations at the biological level like different metabolites from a same pathway/ or degradation products, and also at the analytical level. For instance, one metabolite will ionise in different adducts, with different isotopic clusters and so on. Metabolomics is the science of untargeted profiling so obviously we are working with unknown metabolites and their numbers can drastically change between experiments and/or conditions. There lies a contrast with microarray data for which the sequence is known and identity is not speculated. Quite often, for one single biological question, multiple analytical techniques are used. Here it’s basically MALDI-MS and ESI MS with or without LC. And finally metabolomics data are sensitive to many effects. Without being too technical, errors and biases are common (lockmass issues, calibration of instruments, type of samples etc.) so are confounders when you input a complex multifactor situation. Considering this intricate situation, how could we analyse metabolomics data ? High throughput High dimensionality N variables >> N samples Sensitive to many effects Error & bias Confounders

Why using multivariate methods ?
They are not an end point but… … continuing univariate analysis is often misleading and inefficient (t-test on 100s of variables) IT IS ABOUT Extracting information from data with multiple variables by using all the variables simultaneously MUCH LESS ABOUT How to structure the problem Which variables to measure Which observations to measure

Fundamental data analysis objectives
II Overview Classification Discrimination Regression Trends Outliers Quality Control Biological Diversity Patient Monitoring Pattern Recognition Diagnostics Healthy/Diseased Toxicity Mechanisms Disease Progression Discrimination between multiple groups Biomarkers candidates Comparing studies or instrumentation Comparing blocks of omics data Metab. vs Proteomic vs Genomic vs etc. Correlation spectroscopy (STOCSY) It’s your call to choose a particular MVA technique. However, many studies have proven useful in selecting typical workflows that are less hardwork and provide great output for metabolomics. I am going to focus on two main methods. One is unsupervised, PCA and the other is supervised, which means it works with classification. That is OPLSDA. PCA is broadly used in any omics areas and give best overview of a complex problem. PCA can detect trends and outliers and is usually used for QC. OPLSDA is used for more discrimination purposes between multiple groups (2 or more) and is best for biomarkers discovery. PCA PCA-Class (SIMCA) PLS-DA, OPLS-DA, O2PLS-DA O2-PLS UNSUPERVISED SUPERVISED

Principal Component Analysis (PCA)
UNSUPERVISED MVA based on projection methods Main tool used in chemometrics Extract and display the systematic variation in the data Each PC is a linear combination of the original data parameters Each successive PC explains the maximum variance possible, not accounted for by the previous PCs PCs Orthogonal to each other Conversion of original data leads to two matrices, known as scores and loadings The scores (T) = low-dimensional plane that closely approximates X. Linear combinations of the original variables. Each point represents a single sample spectrum. The loading plot (P) shows the influence (weight) of the individual X-variables in the model. Each point represents a different spectral intensity. The part of X that is not explained by the model forms the residuals (E) X = TPT = t1p1T + t2p2T E

Summarises the relationship (similarities/differences) between samples
PCA, a matrix factorization tool (X=TP’ + E) variables t1 t2 ti 1 K 1 2 T 3 “scores” Summarises the relationship (similarities/differences) between samples samples X N Scores: t[1] and t[2] p’1 P’ p’2 20 ` 13 5 9 3 17 7 2 p’i 11 t[2] 12 8 16 10 18 “loadings” Summarises the relationship (positive/negative correlationships) between variables 1 14 15 6 20 4 19 -20 -60 -40 -20 20 40 60 t[1] Credit: Henrik Antti / Umea University

Building models: geometrical view
1 biological sample 1 metabolic profile (e.g. LC-MS trace) 1 row of data matrix 1 point in ‘metabolic space’ x1 x2 x3 1 Spectra divided into M regions ‘Metabolic Space’ x1 x2 x3 Data Matrix ‘X’ MVA are built on geometrical views. If you consider one biological sample, your metabolomics output is one metabolic profile from your favourite technique. From your data matrix, this will be on row. The beauty of MVA is to convert this one row into a metabolic space where variables are defined by planes. Therefore, MVA are good techniques for data reduction. Indeed, from multiple variables that characterise different samples, typically 2 or 3 geometrical variables in the metabolic plane will explain the distribution of your samples and/or metabolites. 8

Partial Least Square (PLS)
SUPERVISED learning method. Recommended for two-class cases. Principles that of PCA, but in PLS, a second piece of information is used, namely, the labelled set of class identities. Two data tables considered namely X (input data from samples) and Y (containing qualitative values, such as class belonging, treatment of samples)‏. The quantitative relationship between the two tables is sought: X = TPT + E Y = TCT + E The PLS algorithm maximizes the covariance between the X variables and the Y variables. The next level is PLS (Partial Least Squares). It is an extension of Multivariate linear regression. This is a supervised learning method which substantially prones to overfitting. Normal PLS is recommended for 2 class cases. It takes a different path from PCA, instead of ordering by variance, it believes that there are certain underlying latent factors causing the variance. PLS maximizes covariance. In a setup with Factors X and Responses Y, we try to extract the latent variables S and T via regression of X and Y with S and T respectively. It can be used where traditional methods fail, for example when no of predictors is more than the number of observations. PLS-DA incorporates the case where some attributes are present, and OPLS-DA is a further generalization of the same. What you will use depends on your problem setup, and what exactly is required.

Orthogonal PLS OPLS method is a recent modification of the PLS method to help overcome pitfalls. Main idea to separate systematic variation in X into two parts, one linearly related to Y and one unrelated (orthogonal). The unrelated one is a conceptualisation of data noise. Comprises two modelled variations, the Y-predictive (TpPpT) and the Y-orthogonal (ToPoT) components. Only Y-predictive variation used for modelling of Y. X = TpPpT + ToPoT + E Y = TpCpT + F E and F are the residual matrices of X and Y

Proposed workflow PCA Data transformation O(2)PLS-DA Model validation
Quality Control Trends Outliers Data transformation Centring and scaling Log-transformation O(2)PLS-DA Class discrimination Biomarkers discovery Here is a proposed workflow that is typically used in chemometrics. After inputting your dataset, there is a transformation step. I will develop this point later. Firstly, a PCA is applied to check for scaling, for quality control and to highlight putative trends and outliers. When the model is considered robust, then more sophisticated learning methods con be applied. OPLS-DA is typically employed for groups discrimination and biomarkers candidates. Bear in mind that models must be validated with specific parameters. Model validation R2, Q2 and Cross Validation Back up with univariate analyses for selected variables

Transforming data Often ignored or considered of low importance in metabolomic studies Many papers don’t even fully report what transformation and/or scaling was done ! In practice ? No scaling (mean-centering only) Pareto scaling (/root square std. dev.). Autoscaling (to unit variance) is default in SIMCA What is often left out? Log transformation not used as frequently as you might expect Data transformations are often ignored or considered of low importance in metabolomic studies What tends to be done in practice?

Centring and scaling Centring – move centre of point swarm to the origin var. 3 var. 3 mean var. 2 var. 2 Class I: Centering Centering converts all the concentrations to fluctuations around zero instead of around the mean of the metabolite concentrations. Hereby, it adjusts for differences in the offset between high and low abundant metabolites. It is therefore used to focus on the fluctuating part of the data [8,9], and leaves only the relevant variation (being the variation between the samples) for analysis. Centering is applied in combination with all the methods described below. var. 1 var. 1 Credit: Henrik Antti / Umea University

Centring and scaling Scaling – put each variable on an equal footing
e.g. make standard deviations equal (not the only way) var. 3 var. 3 var. 2 var. 2 Scaling based on data dispersion Scaling methods tested that use a dispersion measure for scaling were autoscaling [9], pareto scaling [10], range scaling [11], and vast scaling [12] (Table 1). Autoscaling, also called unit or unit variance scaling, is commonly applied and uses the standard deviation as the scaling factor [9]. After autoscaling, all metabolites have a standard deviation of one and therefore the data is analyzed on the basis of correlations instead of covariances, as is the case with centering. Pareto scaling [10] is very similar to autoscaling. However, instead of the standard deviation, the square root of the standard deviation is used as the scaling factor. Now, large fold changes are decreased more than small fold changes, thus the large fold changes are less dominant compared to clean data. Furthermore, the data does not become dimensionless as after autoscaling (Table 1). var. 1 var. 1

Centring and scaling Most analytical data has two regimes:
Low intensity (close to detection limit): std. dev. = const. High intensity: std. dev. / mean = const. Above the noise, large peaks vary more than small ones Large peaks can dominate models  Scaling/transformation can help

Log-transforming data
Obviously, converts log-normal distribution to a normal one at least a more symmetric distribution Transformations are generally applied to correct for heteroscedasticity [7], to convert multiplicative relations into additive relations, and to make skewed distributions (more) symmetric. In biology, relations between variables are not necessarily additive but can also be multiplicative [13]. A transformation is then necessary to identify such a relation with linear techniques. Since the log transformation and the power transformation reduce large values in the data set relatively more than the small values, the transformations have a pseudo scaling effect as differences between large and small values in the data are reduced.

Log-transforming data
Obviously, converts log-normal distribution to a normal one But also makes it easier to compare between two different variables

What is the practical value?
Reduce noise in the data to thereby enhance information content and quality Example set of real samples – UPLC-qTOF-MS data, two groups Water vs 1 mM Salicylic acid

No scaling Pareto-scaled Pareto-scaled + Log-transformed Centre-scaled

What is the practical value?
Without some kind of transformation, PCA/PLS results dominated by high-concentration metabolites UV scaling can strongly overweight low-intensity peaks Particular problem when you have noise regions in the data – somewhat different when you have peak detection Log transformation and Pareto scaling are best Log : reduce relative abundance of large values Pareto : keep data structure closer to its original measurement Greater effect on which peaks are identified as influential

One last suggestion for transformation…
Scale to unit variance in the control set only. Very simple thing to do!

Outliers Strong outliers Found in scores plots
Detection tool: Hotelling’s T2 Moderate outliers Found in residuals plots Detection tool: DmodX threshold

Strong outlier detection : Hotelling’s T2
Hotelling's T2 is a multivariate generalisation of Student's t-test Ellipse of constant T2  confidence region No strong outliers Strong outliers Strong outliers are found as deviating points in the scatter plot. The Hotelling’s T2 region shown as an ellipse defines the 95% confidence interval of the model variation. In the score plot, the confidence interval is defined by the Hotelling’s T2 ellipse and observation outside the confidence ellipse are considered outliers. Hotelling’s ellipse Credit: Henrik Antti / Umea University

Moderate outliers : DmodX
var. 1 var. 2 (i) p1 ei1 Distance to model: moderate outliers detected through residuals. Residuals = perpendicular distance from model to observed point. Residuals = E, variation not explained by PCs. Moderates outliers can be detected by the DmodX plot. It’s a statistical test for detecting outliers based on the model residual variance. Distribution of residuals ~ Normal F test specifies cut-off at e.g. 99% confidence. Credit: Henrik Antti / Umea University

Model validation Can we trust conclusions based on the model?
Statistical and biological validation Statistical validation Goodness of fit to data = R2 Goodness of prediction for new data = Q2 Cross-validation for O(2)PLS-DA = CV-ANOVA Often ignored, but a vital stage of modelling process ! Goodness of a model is all about the perspective…

Conclusion: a mixed bag
Do not just apply a single transformation to your data when analysing it Par + Log tends to give best results Combine unbiased non-supervised with supervised methods Multivariate and univariate analyses are both useful ! Back up MVA with traditional statistical analyses of key biomarkers/variables Make sure your own results are robust We usually know when we are squeezing data beyond a sensible point… Model validation is crucial

Want to know more ? Tapp HS, Kemsley EK (2009) Notes on the practical utility of OPLS. TrAC Trends Anal Chem 28: 1322–1327 Trygg J, Holmes E, Lundstedt T (2007) Chemometrics in Metabonomics. J Proteome Res 6: 469–479 Wiklund S (2008) Multivariate Data Analysis for Omics. Umetrics, Umeå, Sweden Van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7: 142 Metabolomics Fiehn Lab:

Projection of Data The algorithm defines the position of the light source Principal Components Analysis (PCA) unsupervised maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) supervised maximize covariance (Y ~ X)

Model validation (2) - train/test
Training set Split the data Training set - build the model Test set - validate the model Test set should be independent! Typically require >1/3 data in test set All model parameters optimised on training set E.g. no. components, variables selected etc. Goodness of fit statistic on test data indicates predictive quality of the model Test set 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Build model Predict test set

Model validation (3) - Cross-validation
Rounds Training set General principle: Remove some data Build model on remaining data Predict removed data Repeat until all samples removed once Compute predictions & residuals (eik) for each sample when left out Calculate PRESS and Q2 from all residuals Can do this for X or Y Test set 1 2 3 4 5 6 7 8 6 4 3 7 1 8 5 2 Samples Predict test set Build model ‘3-fold’ cross validatoin

Model validation (4) - R2 & Q2
R2  how much of the total variance is explained by the model R2 = 1 – RESS / TSS where RESS = Residual Error Sum of Squares = Sum(eik2) and TSS = Total Sum of Squares = Sum(xik2) Q2  how much variance is predictable by the model Or…how robust model is to removing data Q2 = 1 – PRESS / TSS where PRESS = Predicted Residual Error Sum of Squares = Sum(ê2) Residual for a predicted sample

Cross-validation - R2 and Q2
R2 & Q2 plot from SIMCA-P software R2 rises with each component Q2 rises, then reaches plateau or falls Extra components are fitting structure which is unstable  noise R2 Q2 Number of components

Multivariate Analysis For Omics Data

Similar presentations

Presentation on theme: "Multivariate Analysis For Omics Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multivariate Analysis For Omics Data

Similar presentations

Presentation on theme: "Multivariate Analysis For Omics Data"— Presentation transcript:

Similar presentations

About project

Feedback