Introduction to metabolomics and data integration Tom Lawson
Experience Hands up who plan on doing metabolomics work? Hands up if you have already performed metabolomics analysis/experiments? Hands up if who have R experience?
birmingham.ac.uk/bmtc @BirmMetTrain bmtc@contacts.bham.ac.uk Providing training to empower the next generation of metabolomics researchers The Birmingham Metabolomics Training Centre provides training to the metabolomics community in both analytical and computational methods. A combination of both face-to-face and online courses are provided. For full course listings, booking and more: 2017 Course List Face-to-Face Courses Introduction to Metabolomics for the Clinical Scientist 21st July 2017, 1st December 2017 Quality Assurance and Quality Control in Metabolomics 12th – 13th October 2017 Metabolite identification with the Q Exactive and LTQ Orbitrap 15th – 16th May 2017, 14th – 15th December 2017 Multiple Biofluid and Tissue Types, From Sample Preparation to Analysis Strategies for Metabolomics 5th – 7th June 2017, 6th – 8th December 2017 Metabolomics with the Q Exactive 3rd – 5th April 2017, 6th – 8th November 2017 Introduction to Metabolomics for the Microbiologist 20th - 22nd November 2017 Online Courses Metabolomics: Understanding Metabolism in the 21st Century 8th May – 2nd June 2017 Metabolomics Data Processing and Data Analysis 20th February – 17th March 2017 birmingham.ac.uk/bmtc @BirmMetTrain bmtc@contacts.bham.ac.uk
Outline General introduction to metabolomics 30 - 40 mins Data processing and analysis used in Metabolomics (with a focus on mass spectrometry) 45-50 mins Data integration 20-30 mins
Outline General introduction to metabolomics 30 - 40 mins
Outline Introduction to metabolomics General overview Sample preparation Measurement Technologies Separation technologies Metabolomic data repositories
Metabolomics ref 1 Study of all low molecular weight compounds (metabolites) in a biological system Metabolites have crucial functions: Signalling, stimulation effects on enzymes, fuel Studying the metabolome provides system wide understanding of biological mechanisms and pathways Genome transcriptome proteome Types of metabolites: peptides, oligonucleotides, sugars, nucelosides, organic acids, ketones, aldehydes, amines, amino acids, lipids, steroids, alkaloids, foods, food additives, toxins, pollutants, drugs and drug metabolites metabolome
Where do the metabolites come from? Metabolites synthesized from small molecule precursors Exogenous compounds coming from the diet, including those common with human metabolites Pharmaceuticals including antibiotics that alter and are altered by the microbiome Metabolite pool in tissues and biofluids Metabolites arising from commensal bacteria in the human gut (and other microbiomes) Environmental chemicals and toxins Metabolites specific to invasive, infecting microorganisms Taken from Stephen Barnes slides, UAB metabolomics workshop 2013 https://www.uab.edu/proteomics/metabolomics/workshop/2013/8%20Barnes%20-%20Untargeted%20and%20translational%20metabolomics.pdf
Metabolomic experiment typical goals Differentiate groups Can we see metabolite differences between sample groups (e.g. wild type & mutant)? Quantification Can we measure metabolite differences? Identification Can we identify the metabolites that have changed? Systems biology integration How do these metabolites interact with DNA, RNA and proteins?
Untargeted vs targeted metabolomics Literature Untargeted experiment Other omics Untargeted experiment Targeted experiment Measure unexpected changes in known (and unknown) metabolites Test hypothesis of expected changes in known metabolites Global metabolite profile with relative quantification >1000s metabolites measured Small number identified No chemical standards required Quantification of specific metabolites Approx. 20 metabolites measured Identify all with confidence Requires chemical standard Generate hypothesis for these metabolites https://www.futurelearn.com/courses/metabolomics/0/steps/10688
Metabolomics untargeted workflow Alonso, Arnald, Sara Marsal, and Antonio Julià. "Analytical methods in untargeted metabolomics: state of the art in 2015." Frontiers in bioengineering and biotechnology 3 (2015): 23.
Metabolomics untargeted workflow 1st part of training Alonso, Arnald, Sara Marsal, and Antonio Julià. "Analytical methods in untargeted metabolomics: state of the art in 2015." Frontiers in bioengineering and biotechnology 3 (2015): 23.
Outline Introduction to metabolomics General overview Sample preparation Measurement Technologies Separation technologies Metabolomic data repositories
Sample preparation: Metabolite pre-extraction Quenching (stopping unwanted biochemical reactions) Inhibit enzymatic activity How? Methanol (cold < −40°C) or liquid nitrogen, causes a sudden temperature shock. Drying Inhibit enzymatic activity and microbial growth Water can effect the “solvation power of the extraction solvents” Blast Nitrogen onto samples Mushtaq, Mian Yahya, et al. "Extraction for metabolomics: access to the metabolome." Phytochemical analysis 25.4 (2014): 291-306.
Sample preparation: Metabolite extraction Metabolome can be divided into those that are water-soluble and those that are not. Extraction methods depend on sample type and the metabolites under investigation. For polar metabolites (e.g. sugars, amino acids, alkaloids): 2.5 Methanol: 1 water For non-polar metabolites (e.g. lipids): 1 methanol : 1 chloroform : 0.9 water [1] biphasic polar and non-polar separated Can actually be used for both polar and non polar extraction [1] Bligh and Dyer 1959
Outline Introduction to metabolomics General overview Sample preparation Measurement Technologies Separation technologies Metabolomic data repositories
Measurement Technologies Nuclear Magnetic Resonance (NMR) spectroscopy Mass spectrometry (MS) Imaging MS
NMR - Background 1948, Varian, founded in San Francisco, measured the gyro-magnetic ratio of certain atoms. This effect later became known as nuclear magnetic resonance http://www.agilent.com/labs/features/2013_101_nmr.html
NMR - Background Atomic nucleus is a spinning charged particle Generates magnetic field Natural spin state is spin is random When external magnetic field applied the nuclei align with or against the field Using radio frequency corresponding to a specific set of nuclei will cause a flip from the alpha to Beta spin state Relaxation of the nuclei to their original spin state emits characteristic electromagnetic signals Captured as function of signal intensity vs time Time-domain converted into frequency-domain through Fourier Transformation (FT). “The location, shape, and area of the signals in each spectrum provide spatial and connectivity information about the nuclei in the sample” [2] [1] [2] [1] http://www.chem.ucla.edu/~harding/notes/notes_14C_nmr02.pdf [2] http://www.agilent.com/labs/features/2013_101_nmr.html
NMR - metabolomics One-dimensional (1D) 1H NMR is the most widely used NMR approach in metabolomics 13C, 15N, and 31P can also be used Two dimensional (2D) NMR methods offer unambiguous identification of metabolites. NMR Benefits: Very reproducible Non destructive of sample Typically no separation required [1] [1] Hao, Jie, et al. "Bayesian deconvolution and quantification of metabolites in complex 1D NMR spectra using BATMAN." Nature protocols 9.6 (2014): 1416-1427.
NMR - popular manufacturers AVANCE III HD Fourier 300 HD JNM-ECZS
MS - Background Joseph John Thomson (1856 – 1940). Amongst other things, created the first mass spectrometer (then called a parabola spectrograph) for the determination of mass-to-charge ratios of ions ca. 1912 at University of Cambridge. [1] [2] [1] GWS - The Great War: The Standard History of the All Europe Conflict (volume four) edited by H. W. Wilson and J. A. Hammerton [2] Cambridge station ca. 1850 http://www.cambridge-news.co.uk/news/history/how-cambridges-trains-first-took-12460397
MS – Simple schematic Simplistically, a mass spectrometer has 3 components (ion source, m/z analyser and detector) inlet Ion source m/z analyser Detector Data system
MS – ion source Can only detect charged ions, so we need to generate gas-phase ions with an ion source Upon ionisation of a sample molecule (M) a molecular ion (M+ or M- ) is formed Common types Atmospheric chemical ionisation (APCI) Electron ionisation (EI) Electrospray ionisation (ESI) Matrix-assisted laser desorption/ionisation (MALDI) EI used for GC-MS ESI and APCI used for LC-MS inlet Ion source m/z analyser Detector Data system
MS – Mass Analyser Separation of the ions based on their mass-to-charge (m/z) ratio Use either magnetic or electric fields Common types: Triple quadrupole (QQQ) Time-of-flight (TOF) Quadrupole TOF (qTOF) Orbitrap Fourier Transform Ion Cyclone Resonance (FT-ICR) inlet Ion source m/z analyser Detector Data system
MS - Background Detector Detects the ion beams generated from the m/z analyser. Types Electron multipliers Faraday cups Photographic plates inlet Ion source m/z analyser Detector Data system
Example MS spectra Arginine Measure the ions [M+H] Arginine Neutral exact mass: 174.11168 Measure the ions Adduct + Exact mass (Expected) (Observed) [M+H] +1.007276 175.118956 175.118965 [M+Na] +22.989218 197.100898 197.100918 [M+Na] http://mona.fiehnlab.ucdavis.edu/spectra/display/BML80770 (data originally from MassBank, LC-ESI-QTOF)
Isotopes Atoms with the same # protons but different # neutrons 12C : Mass of atom: 12.00000, natural abundance: 98.93 % 13C : Mass of atom: 13.00335, natural abundance: 1.07 % 12C 13C C6H12O6: Theoretical isotopic distribution and mass spec http://www.sisweb.com/mstools/isotope.htm
Tandem MS (MS/MS) Overview Tandem in space Tandem in time MS1: Samples ionised, separated by m/z and then detect Isolation: Isolate an ion of interest Fragmentation: Create fragment ions from collision induced dissociation (CID) or other methods MS2: For the fragment ions separate by m/z and then detect Tandem in space Quadrupole Time of Flight (TOF) Tandem in time Ion trap Allows MSn To achieve detailed fragmentation with soft ionization techniques, such as electrospray ionization (ESI) in conjunction with liquid chromatography (LC), a collision step is normally required, which can be attained through collision-induced dissociation (CID) or higher-energy collisional dissociation (HCD). The term tandem mass spectrometry (MS/MS or MS2) is used when a single collision step is used but fragment ions can be isolated for further collision to provide MS3 spectra or more.
MS – popular manufacturers Waters Thermo Fisher Agilent AB Sciex Bruker Thermo Fisher Q Exactive (quadrupole orbitrap) Thermo Fisher Oribitrap Elite (ion trap orbitrap) Xevo G2-XS QTof (Quadrupole Time-of-Flight)
MS – a lot of choice! This is just with Thermo fisher Orbitrap systems…
NMR vs MS NMR MS Reproducibility Very good fair Sensitivity Less sensitive (≈100 metabolites) Very sensitive (>1000 metabolites) Sample recovery non-destructive destructive Sample preparation Minimal (no separation required) Depends (typically separation performed but not always) Sample used Typically 200-400 µL but microcoil probes can be 5-10 µL Very low µL Popularity (Number of papers 2013-2017 in PubMed*) 1780 4644 * (((Mass spectrometry) OR MS) AND Metabolomics) AND ("2013"[Date - Publication] : "3000"[Date - Publication]) (((Nuclear Magnetic Resonance) OR NMR) AND Metabolomics) AND ("2013"[Date - Publication] : "3000"[Date - Publication])
MS vs NMR Can we use both and integrate? Yes but expensive and requires expertise Bingol, Kerem, and Rafael Brüschweiler. "Two elephants in the room: new hybrid nuclear magnetic resonance and mass spectrometry approaches for metabolomics." Current opinion in clinical nutrition and metabolic care 18.5 (2015): 471-477.
Outline Introduction to metabolomics General overview Sample preparation Measurement Technologies Separation technologies Metabolomic data repositories
Separation technology Very complex mixtures of compounds Peptides, oligonucleotides, sugars, nucelosides, organic acids, ketones, aldehydes, amines, lipids etc Some metabolites have the same mass but different structure (isomers) For more accurate annotation/identification of metabolites a separation technique is required e.g.: Solid phase extraction (SPE) Gas chromatography (GC) Capillary electrophoresis (CE) High Performance Liquid chromatography (HPLC) Ultra High Performance Liquid chromatography (UHPLC)
LC-MS spectra 3 Dimensions Time Intensity m/z
Outline Introduction to metabolomics General overview Sample preparation Measurement Technologies Separation technologies Metabolomic data repositories
Metabolomic data repositories Experimental data and meta data from Metabolomic studies are stored in MetaboLights (here at EBI Cambridge) Metabomics Workbench Makes research more reproducible and open Facilitates new research!
MetaboLights repository Requires submission of studies in ISA- Tab file format Hierarchical structure of files for recording experimental, sample and study design information Investigation file Study file(s) Assay file(s)
mzML2ISA & nmrML2ISA software Submission to MetaboLights requires is time consuming and error prone Most instrument metadata in open source file format mzML2ISA & nmrML2ISA software Automatically create semi-complete ISA-Tab files Reduces time and user error API, CLI, GUI and Galaxy interface Documentation: http://2isa.readthedocs.io/en/latest/ Youtube: https://www.youtube.com/watch?v=xy3uusQRkbI
Workflow Documentation: http://2isa.readthedocs.io/en/latest/ Youtube: https://www.youtube.com/watch?v=xy3uusQRkbI
Outline General introduction to metabolomics 20 - 25 mins Data processing and analysis used in Metabolomics (focus on LC-MS) 30 mins Data integration 20 mins
Outline General introduction to metabolomics 20 - 25 mins Data processing and analysis used in Metabolomics (focus on LC-MS) 30 mins Data integration 20 mins
Outline Metabolomics data processing and analysis (focus on mass spectrometry) Open source vs propriety Spectral processing Data analysis Annotation
Outline Metabolomics data processing and analysis (focus on mass spectrometry) Open source vs propriety Spectral processing Data analysis Annotation
Open source vs propriety software Vendors (Thermo, AB Sciex, Waters, Agilent) all provide their own software Excalibur LipidSearch Compound discover MassLynx Progenesis Open source tools available for most things though. Recent survey [1] found most researches use open source software and commercial software bundled with the instrument XCMS the most popular metabolomics software [1] Weber, Ralf JM, Thomas N. Lawson, Reza M. Salek, Timothy MD Ebbels, Robert C. Glen, Royston Goodacre, Julian L. Griffin et al. "Computational tools and workflows in metabolomics: An international survey highlights the opportunity for harmonisation through Galaxy." Metabolomics 13, no. 2 (2017): 12.
Outline Metabolomics data processing and analysis (focus on mass spectrometry) Open source vs propriety Spectral processing Data analysis Annotation
Metabolomics untargeted workflow Alonso, Arnald, Sara Marsal, and Antonio Julià. "Analytical methods in untargeted metabolomics: state of the art in 2015." Frontiers in bioengineering and biotechnology 3 (2015): 23.
Spectral processing (focus on LC-MS) Alonso, Arnald, Sara Marsal, and Antonio Julià. "Analytical methods in untargeted metabolomics: state of the art in 2015." Frontiers in bioengineering and biotechnology 3 (2015): 23.
LC-MS spectral processing and adduct workflow Raw files mzML files msconvert CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts Adduct annotated peaklist
Raw file conversion Raw files mzML files msconvert Convert “raw” spectra (propriety vendor format) to open source format mzML Current open source standard for Mass spectrometry metabolomics and proteomics data msconvert (proteowizard) tool GUI and CLI Profile vs centroid http://workflow4metabolomics.org/sites/workflow4metabolomics.org/files/files/w4m_xcms_CAMERA_LC-MS.pdf
The Benefits of Open Source File Formats Makes it easy for software developers and bioinformaticians Only need to write code for 1 file format Standardised data (and metadata) Don’t need to have access to proprietary software Open Source File Format instrumentation .mzML mass spectrometry .imzML imaging mass spectrometry .nmrML Nuclear magnetic resonance spectroscopy
Peak picking XCMS Chromatographic peak picking algorithms Parameters Raw files mzML files msconvert XCMS Very popular open source data pre-processing software for LC-MS and GC-MS Chromatographic peak picking algorithms matchFilter Original algorithm Use for profile low resolution MS data centWave Use for centroid high resolution data Parameters XCMS online XCMS Feature detection
Peak picking Raw files mzML files msconvert XCMS Feature detection
Grouping Raw files mzML files msconvert The feature detection stage works on each file (sample or replicate) at a time We need to group chromatographic features between files based on the m/z and time range XCMS Feature detection Grouping
Retention time alignment Raw files mzML files msconvert There can be drift between LC-MS runs which might need correction Can be required for larger studies Locally weighted scatterplot smoothing (LOESS) Use “well behaved” peak groups to calculate retention time deviations. OBI-warp Dynamic time warping algorithm XCMS Feature detection Grouping Retention time alignment Grouping
What parameters to use? Can use XCMS online as starting point R documentation for parameters Workflow4metabolomics Galaxy implementation of XCMS and CAMERA Provide user friendly documentation of the tools [1] IPO R package can optimize parameters [2] [1] http://workflow4metabolomics.org/sites/workflow4metabolomics.org/files/files/w4m_xcms_CAMERA_LC-MS.pdf [2] https://bioconductor.org/packages/devel/bioc/vignettes/IPO/inst/doc/IPO.html
XCMS processing [1] Try it yourself! Create user account for XCMS online https://xcmsonline.scripps.edu/ Create single Job (MORE TO DO)
Try it yourself! XCMS processing [1] Open R studio
Adduct annotation: Initial grouping Raw files mzML files msconvert Using the most intense features the data is divided into rough retention time groups. Based on 60% of the chromatographic peak fwhm (full width at half-maximum) CAMERA XCMS groupFWHM Feature detection Grouping Retention time alignment Grouping
Adduct annotation: Isotopes Raw files mzML files msconvert CAMERA Look for C12/C13 isotope differences of m/z 1.0033 Checks if intensity profile matches for [M+]+ to [M+1]+ XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment Grouping
Adduct annotation: correlation Raw files mzML files msconvert Using the extracted ion chromatograms (EIC) for each feature are used to calculate two types of correlation between features: Correlation across samples: CAS Correlation within samples : CPSi CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping
groupCorr: CAMERA relationship scoring Pearson's correlation CAS: Intensity correlation across samples CPSi: Peakshape correlation for sample i ISO: if isotope relationship (1 or 0)
groupCorr: CAMERA relationship scoring The relationship map can be used to build a network Nodes: features (peak) Weighted Edges: The score (above a threshold) If features have many connected close relationship they are likely to be from the same compound The networks are broken into smaller groups that are more connected “Highly- connected-subgraphs” algorithm R Creates peak correlation (peak cluster) groups e.g. Red and blue colours Potentially 2 different closely eluting compounds
Adduct annotation: find the adducts Raw files mzML files msconvert All m/z-differences within a Peak correlation group are matched against a list of rules. CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts
Adduct annotation: find the adducts Raw files mzML files msconvert Matches with the same molecular mass hypothesis (below a given relative error) are combined into “groups” CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts
Adduct annotation: results! Raw files mzML files msconvert CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts Adduct annotated peaklist
Direct infusion mass spectrometry (DIMS) No chromatography High throughput alternative to LC-MS or GC-MS Protocol for complete experimental and data analysis workflow using spectral-stitching method [1] https://github.com/Viant-Metabolomics/Galaxy-M Other software to process DI-MS data XCMS msPurity MI-pack (annotation) Southam, A. D., Weber, R. J., Engel, J., Jones, M. R., & Viant, M. R. (2016). A complete workflow for high-resolution spectral-stitching nanoelectrospray direct-infusion mass-spectrometry-based metabolomics and lipidomics. Nature Protocols, 12(2), 255-273
Try it yourself! CAMERA
Other important processing steps Blank filter Sample intensity should be > 5 times that of blank intensity RSD < 20% for samples and QCs*? Feature found in QC samples? *QC (quality control samples): Pool of all the samples
Outline Metabolomics data processing and analysis (focus on mass spectrometry) Open source vs propriety Spectral processing Data analysis Annotation
Metabolomics untargeted workflow
Data analysis
Feature matrix Similar data analysis approaches used for metabolomics and other ‘omics technologies. Large data matrices of samples vs features Look for fold changes between sample groups Univariate statistics Multivariate statistics samples mz/RT features
Univariate statistics Experimental design Normal distribution (compare means) For from normal (compare medians) Compare two unpaired groups Unpaired t-test Mann-Whitney Compare two paired groups Paired t-test Wilcoxon signed-rank Compare more than two unmatched groups One-way ANOVA with multiple comparison Kruskal Wallis Compare more than two matched groups Repeated-measures ANOVA Friedman Multiple testing correction e.g. Benjamin Hochberg Vinaixa, Maria, et al. "A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data." Metabolites 2.4 (2012): 775-795.
Multivariate statistics Simultaneous observe multiple characteristics Some classic multivariable model assumptions are not fulfilled for chemometric / ‘omic data Less observations than variables Correlations between variables However, various methods can still be used or have been adapted for chemometrics / ‘omics
Multivariate data matrix convention mz RT features samples
Principal component analysis (PCA) Addresses the following problems with multivariate datasets: Visualisation of data with more than 3 variables not possible High correlation between samples makes many statistical methods not applicable Many variables contain only very few information A good first visual check for any multivariate dataset (outliers) Classic multivariate statistics states that we should have more samples than variables PCA is unsupervised so less susceptible to this problem (compared to PLS-DA)
PCA (basics) figure 1 Eigenvector and eigenvalues Principal component analysis is a good name! It takes the principal components of the data Directions where there is the most variance (most spread out) Simple example with 2 variables Variance is spread equally across both source variables in figure 1 If we use two new axis (figure 2) then we can explain the majority of variance in 1 dimension Reduces dimensionality of data For many variables it can be very useful Eigenvector and eigenvalues Eigenvector: The direction of the line Eigenvalue: Spread of the data on the line (variance) Principal component 1 (PC1) is the eigenvector with the largest amount of variance The largest number of PCs is the minimum number of rows and columns of the matrix figure 2 http://sqlblog.com/blogs/dejan_sarka/archive/2015/06/02/data-mining-algorithms-principal-component-analysis.aspx
PCA (plots) Scree plot Scores plot Loadings plot Bi plot Show how much variance Is explained by the PCs Projections of PCs onto each sample Projections of PCs onto each metabolite Combine the scores and loading. Determine Which metabolite is “driving the separation” Data and plots from MetaboAnalyst.ca
Partial least squares discriminant analysis (PLS-DA) Supervised method Describe the difference between classes of the samples (e.g. wild type, mutant) maximizes the covariance between the X variables and the Y variables Validation and cross validation step required Use R2 (and Q2) to measure how well the prediction performed Use PLS-DA with caution… Low number of samples gives a high risk of overfitting R2 and Q2 not ideal for categorical data R2 : coefficient of variance (how well the regression fits) Q2: bit unclear…. But similar measure of goodness of prediction PCA PLS-DA Gromski, Piotr S., et al. "A tutorial review: Metabolomics and partial least squares-discriminant analysis–a marriage of convenience or a shotgun wedding." Analytica chimica acta 879 (2015): 10-23.
Missing value imputation Important for PCA and PLS-DA Missing values in metabolomics datasets observed when: Metabolite is not present in the sample Metabolite is present in the sample but is present below the LOD. Metabolite is present in the sample but is missed during reprocessing Typical first step is to just filter out metabolites or samples that have > x % missing values For case 3 we assume the distribution of missing values is random and can use standard missing value imputation: Small value replacement K-nearest neighbour imputation (KNN) Random Forest imputation (RF) However, case 1 and 2 should really be treated differently In practice this is difficult, and following filtering, only 1 type of missing value imputation is typically performed Centering always needed for PCA and PLS-DA See [1] for recommendations for PCA, PLS-DA and univariate statistics
Normalisation, transformations and scaling Remove variation in the measured response unrelated to the biological differences between samples e.g. Slight differences in preparation and collection for each sample Use sum, mean or median as a normalisation factor Probabilistic quotient normalisation (PQN) Transformation Correct for skewed data and heteroscedasticity Metabolites with large intensities typically have more variation than those with lower intensities Logarithmic transformations natural logarithm (nLog) generalised logarithm (gLog) Scaling Adjust for differences in fold change between metabolites which may be caused by large differences in the variation of the measured responses After scaling, the values are not dependant on the absolute abundance Autoscaling: every peak is mean centered and divided by the standard deviation of the column Pareto scaling: every peak is mean centered and divided by the square root of the standard deviation of the column Logarithms for transformation: (often a constant value is added to cope with near-zero values) Heteroscedasticity – “variability of a variable is unequal across the range of values of a second variable that predicts it.” i.e. does the variability vary as the intensity increases?
Try it yourself! Load in csv file from previous R session into MetaboAnalyst
Outline Metabolomics data processing and analysis (focus on mass spectrometry) Open source vs propriety Spectral processing Data analysis Annotation
Metabolomics untargeted workflow
Metabolomics untargeted workflow
Level of annotation level Name 1 Identified metabolites * 2 Putatively annotated compounds 3 Putatively characterised compound classes 4 Unknown compounds * 2 or more orthogonal proprieties of a chemical standard compared to experimental data using the same analytical methods Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2(1), 13.
Level of annotation level Name 1 Identified metabolites * 2 Putatively annotated compounds 3 Putatively characterised compound classes 4 Unknown compounds Often first step is to search Compound libraries * 2 or more orthogonal proprieties of a chemical standard compared to experimental data using the same analytical methods Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2(1), 13.
Useful metabolomic databases for annotation Pathway databases KEGG MetaCyc Compound databases PubChem ChEBI ChemSpider Spectral databases HMDB MassBank Metlin mzCloud Download spectral libraries http://mona.fiehnlab.ucdavis.edu/downloads
Rumsfeld Annotation Quadrant Expected by Analyst Unexpected by Analyst Known Knowns Expected and found Unknown Knowns Not expected but found Identified by library Known Unknowns Expected but not found Unknown Unknowns Not expected and not found Not identified by library
Neutral mass lookup Search neutral exact mass to exact mass found in databases of known compounds Biologically relevant KEGG, HMDB, MetCyc, ChEBI All compounds PubChem, ChemSpider Often many compounds will match 1 neutral mass Difficult to determine which compound to choose for the annotation Software: MI-pack PUTMEDID Metabosearch
MS/MS and MSn spectral search Fragmentation spectra can be more reliable that simple neutral mass lookup form MS1 spectra Fragmentation libraries can be from real experimental data or in silico generated The predictable fragmentation patterns of lipids allow for in silico libraries (e.g. LipidBlast) Search collected experimental spectra to library spectra Use spectral matching methods LipidBlast: http://fiehnlab.ucdavis.edu/projects/LipidBlast
Spectral matching methods Compare query spectra to library spectra Vectors can be either m/z or intensity or weighted intensity (most common) 𝑎 and 𝑏 are weight factors (different recommendations in literature) [𝑝𝑒𝑎𝑘𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦] 𝑚 ×[𝑚𝑎𝑠𝑠 (𝑚𝑧)] 𝑛 Calculate weighted intensity vector for Query (𝒙) and library (𝒚) i=(100,50, 10) mz=(100, 200, 300) m=0.5 n=2 weighted = (1000.5×1002 , 500.5×2002 , 100.5×3002) weighted = 100000, 282842.7, 9000000 Horai, Hisayuki, Masanori Arita, and Takaaki Nishioka. "Comparison of ESI-MS spectra in MassBank database." 2008 International Conference on BioMedical Engineering and Informatics. Vol. 2. IEEE, 2008.
Spectral matching methods Count peaks Simplest method Count matching peaks between query (𝒙) and library (y) vectors Dot product 𝑛=𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑣𝑒𝑐𝑡𝑜𝑟 𝑑𝑝=𝒙∙ 𝒚 = 𝑥 1 × 𝑦 1 + 𝑥 2 × 𝑦 2 …… + 𝑥 𝑛 × 𝑦 𝑛 Dot product cosine (cosine similarity) Bounds the result between 0 to 1 (when non-negative values are used) 1 being a perfect match dpc= 𝑖=1 𝑛 𝑥 𝑖 ∙ 𝑦 𝑖 𝑖=1 𝑛 𝑥 𝑖 2 ∙ 𝑖=1 𝑛 𝑦 𝑖 2 Many other methods PBM, pMatch, machine learning techniques x=(2,4,3) y=(2,1,1) =(2×2)+(4×1)+(4×1)=11 =11/((√22+42+32)×(√22+12+12)=0.834 Horai, Hisayuki, Masanori Arita, and Takaaki Nishioka. "Comparison of ESI-MS spectra in MassBank database." 2008 International Conference on BioMedical Engineering and Informatics. Vol. 2. IEEE, 2008.
Spectral matching (aligning) But first the peaks of the need the query and library to be aligned !!! Often not discussed in methods of spectral matching but very important Very good explanation of the whole process here (Waters and MassBank approach) http://www.nonlinear.com/progenesis/qi/v2.0/faq/database-fragmentation- algorithm.aspx
Spectral matching limitations Small number of metabolites covered in spectral libraries Different instrument types will have different spectra Different parameters used for instrument will effect spectra Collision induced dissociation (CID) Higher-energy collisional dissociation (HCD) The precursor ion purity What else are you fragmenting?
Typical MS/MS DDA run MS1 spectra scan 1 MS2 spectra scan 2
msPurity Simple metric to assess the contribution of the selected precursor for fragmentation 1 = most pure (all contribution from precursor) 0 = least pure (no contribution from precursor) Interpolates
Precursor Purity Isolate window around peak For m/z of 100.00 +/- 0.5 Da 99.5 to 100.5 Target I = 1224854 Total I = 2187218 Purity (target/total) = 0.56 Resulting MS/MS spectra referred to as chimeric intensity Chimera creature in greek mythology was composed of parts of more than one animal 99.5 m/z 100.5
Interpolate the metric
Isolation efficiency Thermo Scientific Q ExactiveTM Focus. 0.5 Da window A: Isolation efficiency profile of ions: 195.0875, 524.2649 and 922.0086 m/z. B: A simple linear model using B-spline polynomials to predict isolation efficiency based on isolation window position (negative values have been zeroed), adjusted R2: 0.9812, F: 210, p-value < 0.001), Isolation window
Do we have to reply on fragmentation spectral libraries? (nope) MetFrag Combinatorial fragmenter using various heuristics to speed up the process For each suspected compound produces in silico fragmentation spectra Match to experimental fragmentation spectra CSIFingerID machine learning techniques Determines “fingerprint” of experimental spectra Search fingerprint against molecular structure databases MSnPy Creates fragmentation tree networks to explain the spectra Determine possible molecular formulas based on the trees (To be published) Many more: SIRIUS, FT-Blast
Neutral exact mass lookup (1) Try it yourself! Neutral exact mass lookup (1) PubChem search for neutral exact mass https://pubchem.ncbi.nlm.nih.gov/ Use advanced search Search for neutral exact mass 272.06847 How many hits? Search for range around neutral mass 272.0681:272.0687[EXMASS] Choose first hit What is the molecular weight? Why is it different to the exact mass? See REST access to use programmatically https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html
Neutral exact mass lookup (2) Try it yourself! Neutral exact mass lookup (2) Metabosearch Programmatic access to multiple libraries of compounds to do exact mass lookup Download http://omics.georgetown.edu/metabosearch.html Use your grouped_peaklist.csv from the previous activity Annotate the peak list using the standard settings
Spectral matching dot product cosine [1] Try it yourself! Spectral matching dot product cosine [1] Query spectrum: mz = [200.43, 100.32, 98.5] intensity = [1000, 100, 200] Library spectrum mz = [200.43, 160.32, 98.5] intensity = [1000, 20, 10] Calculate weighted vectors for query and library spectrum
Spectral matching dot product cosine [2] Try it yourself! Spectral matching dot product cosine [2] Calculate the dot product cosine of the two weighted vectors dpc= 𝑖=1 𝑛 𝑥 𝑖 ∙ 𝑦 𝑖 𝑖=1 𝑛 𝑥 𝑖 2 ∙ 𝑖=1 𝑛 𝑦 𝑖 2 What other similarity measure could we use?
Spectral matching with Massbank Try it yourself! Spectral matching with Massbank http://www.massbank.jp/SearchPage.html Use the file saved as spectra_for_massbank.txt Consist of one column of mz and one of intensity What is the best hit? What is its match score?
Try it yourself! MetFrag
msPurity: Calculate precursor purity Try it yourself! msPurity: Calculate precursor purity library(msPurity) library(xcms) ###################################### # Calculate purity of MS/MS spectra ###################################### msPurityDataPth <- system.file("extdata", "lcms", "mzML", package="msPurityData") msmsPths <- list.files(msPurityDataPth, full.names = T, pattern = "MSMS") msPths <- list.files(msPurityDataPth, full.names = T, pattern = "LCMS_") pa <- purityA(msmsPths, mostIntense = TRUE, # use the most intense peak for precursor interpol = 'linear', # linear interpolate iwNorm = TRUE, # uses default isolation window normalistaion ilim = 0.05, # remove noise from calculation isotopes = TRUE) # remove isotopes from calculation print(head(pa@puritydf))
msPurity: Link MS/MS to XCMS feature Try it yourself! msPurity: Link MS/MS to XCMS feature ###################################### # Link MS/MS spectra to XCMS feature ###################################### xset <- xcmsSet(msmsPths) xset <- group(xset) xset <- retcor(xset) # Link XCMS feature to MS/MS pa <- frag4feature(pa, xset) print(head(pa@grped_df)) print(pa@grped_ms2[2:3]) Note this is also possible with the xcmsFragmets function from the xcms package
Metabolomics untargeted workflow
Metabolite pathway analysis http://impala.molgen.mpg.de/ Over-representation analysis Input: List of differentially observed metabolites List of all metabolites measured (and annotated) Output: p-value (and q-value) of over represented pathways Enrichment analysis list of every metabolite measured (and annotated) and associated value e.g. log fold change or two values (one for each phenotype) Output Hypergeometric
Outline General introduction to metabolomics 20 - 25 mins Data processing and analysis used in Metabolomics (with a focus on mass spectrometry) 30 mins Data integration 20 mins
Outline General introduction to metabolomics 20 - 25 mins Data processing and analysis used in Metabolomics (with a focus on mass spectrometry) 30 mins Data integration 20 mins
Outline Type of study Types of data integration Types of statistical data integration Genomic scale reconstructions
Data integration review paper The following slides rely heavily on the review paper: Cavill, Rachel, Danyel Jennen, Jos Kleinjans, and Jacob Jan Briedé. "Transcriptomic and metabolomic data integration." Briefings in bioinformatics 17, no. 5 (2016): 891-901.
Type of multi-omic study
Repeated study Perform a different ‘omics study on separately prepared representative samples at a different time/place Advantages: Simple Potentially easier for multiple laboratories Measurements considered statistically different Negatives: Batch effects (difficult or impossible to handle over very different technologies and experiments)
Split sample study Samples originate from the same biological source material e.g. tissue is homogenised and half goes to metabolomics and half goes to transcriptomics Often the ideal situation Advantages Limits batch effects between omics studies Note: will not remove within ‘omic batch affects Looking at the same “thing” Negatives: Feasibility Sample volume Preparation procedure might not allow RNA and metabolites extracted at the same time
Source-match study Use different fractions of the biological system for different ‘omics experiment. e.g. Blood and urine for Metabolomics and tissue for RNA analysis Advantages: Limits batch effects between studies Different source lend themselves better to different techniques Negatives: Additional consideration required for analysis
Replicate-matched study Perform a different ‘omics study on separately prepared representative samples at the same time/place Advantages: Limits batch effects between studies Can be used if insufficient sample to do split-sample study Negatives: Not looking at the same replicate (adds unwanted variation)
Type of data integration
Type of data integration
Types of statistical data integration
Correlation-based integration Standard correlation coefficients Pearson’s (parametric) Spearman’s (non-parametric) Goodman and Kruskal gamma test only takes into account the up/down regulation of each metabolite/gene Partial correlations consideration Evaluate those correlations that are independent of the other co-linear measurements. what is the independent correlation of gene A and metabolite B, given that they are both correlated to gene C
Correlation-based integration Problems Biochemical pathways often don’t correlate with expected corresponding biochemical enzymes [1] Changes in the metabolome and the transcriptome will not be simultaneous Align the data through time first [2]? [1] Ter Kuile BH, Westerhoff HV. Transcriptome meets metabolome: hierarchical and metabolic regulation of the glycolytic pathway. [2] Cavill R, Kleinjans JCS, Briede´ JJ. Dynamic time warping for omics. PLoS One 2013;8:e71823.27.Lu
Dataset concatenation-based integration Simply concatenate the datasets together Use fold changes for each ‘omic dataset Perform analysis (typically multivariate) on new dataset Problems: Dominated by one dataset (e.g. 1000 transcripts vs 100 metabolites) Typically get separate clusters for each ‘omic dataset. i.e. you don’t as many clusters between the ‘omics datasets Recent example of concatenation based integration: Huang, Susie SY, et al. "A multi-omic approach to elucidate low-dose effects of xenobiotics in zebrafish (Danio rerio) larvae." Aquatic Toxicology 182 (2017): 102-112.
Multivariate-based integration Perform multivariate analysis but do not concatenate Many methods and can get very complicated However, possibly the most appropriate route for analysis
Multivariate-based integration: O2PLS Two-way Orthogonal partial least squares Describe the relationships between two (or more) datasets: Map the connected variation between datasets Find unique variation within a dataset One of many methods for multivariate-based integration [1] [1] Bylesjö, Max, et al. "Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data." The Plant Journal 52.6 (2007): 1181-1191. [2] El Bouhaddani, Said, et al. "Evaluation of O2PLS in Omics data integration." BMC bioinformatics. Vol. 17. No. Suppl 2. BioMed Central Ltd, 2016. [3] MKS data analytics solutions tutorial https://www.youtube.com/watch?v=Utj1OZ4W0hc
Multivariate-based integration: O2PLS example Projection of the individuals (scores plot) 30 mg (light grey) 180 mg(dark grey 1100 mg (black) blank (white). Loadings plot of metabolites (grey dots) and transcripts (black dots) levels. Magnified view on the most strongly upregulated (Ellipse 1) and downregulated (Ellipse 2) variables. Eveillard, Alexandre, et al. "Identification of potential mechanisms of toxicity after di-(2-ethylhexyl)-phthalate (DEHP) adult exposure in the liver using a systems biology approach." Toxicology and applied pharmacology 236.3 (2009): 282-292.
Pathway-based integration Pathway enrichment / over- representation analysis Consensus of the phenotype between datasets Considerations: How to combine the p- values? What ‘background lists’ to use
Pathway-based integration significant pathways (Metabolite)
Pathway-based integration significant pathways (Metabolite) significant pathways (Gene)
Pathway-based integration significant pathways (Metabolite) significant pathways (Gene) significant pathways (combined)
Pathway based integration Try it yourself! Pathway based integration IMPaLA http://impala.molgen.mpg.de/ MetaboAnalyst http://www.metaboanalyst.ca/faces/Secure/upload/PathUploadView.xhtml
Type of data integration
Type of data integration
Genome-Wide Metabolic Reconstruction A structured representation of all biochemical metabolic reactions that take place within an organism. Bottom up approach to modelling metabolic capabilities of an organism Takes the annotated genome and finds enzyme encoding genes Infer metabolic pathways from the enzymes This is known as a draft reconstruction Further manual curation required to achieve high accuracy
Available software Pathway tools METRONOME pipeline http://brg.ai.sri.com/ptools/ Uses for BioCyc Requires reasonably good annotation of genome METRONOME pipeline METabolic Reconstruction Of New genOMe sEquences Python package for draft reconstruction where there is minimal annotation
That’s it! Thank you for listening Acknowledgments: Ralf Weber, Martin Jones, James Bradbury, Warwick Dunn, James Bradbury
birmingham.ac.uk/bmtc @BirmMetTrain bmtc@contacts.bham.ac.uk Providing training to empower the next generation of metabolomics researchers The Birmingham Metabolomics Training Centre provides training to the metabolomics community in both analytical and computational methods. A combination of both face-to-face and online courses are provided. For full course listings, booking and more: 2017 Course List Face-to-Face Courses Introduction to Metabolomics for the Clinical Scientist 21st July 2017, 1st December 2017 Quality Assurance and Quality Control in Metabolomics 12th – 13th October 2017 Metabolite identification with the Q Exactive and LTQ Orbitrap 15th – 16th May 2017, 14th – 15th December 2017 Multiple Biofluid and Tissue Types, From Sample Preparation to Analysis Strategies for Metabolomics 5th – 7th June 2017, 6th – 8th December 2017 Metabolomics with the Q Exactive 3rd – 5th April 2017, 6th – 8th November 2017 Introduction to Metabolomics for the Microbiologist 20th - 22nd November 2017 Online Courses Metabolomics: Understanding Metabolism in the 21st Century 8th May – 2nd June 2017 Metabolomics Data Processing and Data Analysis 20th February – 17th March 2017 birmingham.ac.uk/bmtc @BirmMetTrain bmtc@contacts.bham.ac.uk
PCA further https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf http://www.lauradhamilton.com/introduction-to-principal-component-analysis-pca https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4- dummies-eigenvectors-eigenvalues-and-dimension-reduction/ https://cran.r-project.org/web/packages/chemometrics/vignettes/chemometrics- vignette.pdf http://stats.stackexchange.com/questions/222/what-are-principal-component-scores http://pubs.rsc.org/en/content/chapterhtml/2012/bk9781849731638-00001?isbn=978- 1-84973-163-8#sect694 van den Berg, R. A., Hoefsloot, H. C., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1), 142. http://pubs.rsc.org/en/content/articlehtml/2014/ay/c3ay41907j