Introduction to metabolomics and data integration

Introduction to metabolomics and data integration
Tom Lawson

Experience Hands up who plan on doing metabolomics work?
Hands up if you have already performed metabolomics analysis/experiments? Hands up if who have R experience?

birmingham.ac.uk/bmtc @BirmMetTrain bmtc@contacts.bham.ac.uk
Providing training to empower the next generation of metabolomics researchers The Birmingham Metabolomics Training Centre provides training to the metabolomics community in both analytical and computational methods. A combination of both face-to-face and online courses are provided. For full course listings, booking and more: 2017 Course List Face-to-Face Courses Introduction to Metabolomics for the Clinical Scientist 21st July 2017, 1st December 2017 Quality Assurance and Quality Control in Metabolomics 12th – 13th October 2017 Metabolite identification with the Q Exactive and LTQ Orbitrap 15th – 16th May 2017, 14th – 15th December 2017 Multiple Biofluid and Tissue Types, From Sample Preparation to Analysis Strategies for Metabolomics 5th – 7th June 2017, 6th – 8th December 2017 Metabolomics with the Q Exactive 3rd – 5th April 2017, 6th – 8th November 2017 Introduction to Metabolomics for the Microbiologist 20th - 22nd November 2017 Online Courses Metabolomics: Understanding Metabolism in the 21st Century 8th May – 2nd June 2017 Metabolomics Data Processing and Data Analysis 20th February – 17th March 2017 birmingham.ac.uk/bmtc @BirmMetTrain

Outline General introduction to metabolomics
mins Data processing and analysis used in Metabolomics (with a focus on mass spectrometry) 45-50 mins Data integration mins

Outline General introduction to metabolomics mins

Outline Introduction to metabolomics General overview
Sample preparation Measurement Technologies Separation technologies Metabolomic data repositories

Metabolomics ref 1 Study of all low molecular weight compounds (metabolites) in a biological system Metabolites have crucial functions: Signalling, stimulation effects on enzymes, fuel Studying the metabolome provides system wide understanding of biological mechanisms and pathways Genome transcriptome proteome Types of metabolites: peptides, oligonucleotides, sugars, nucelosides, organic acids, ketones, aldehydes, amines, amino acids, lipids, steroids, alkaloids, foods, food additives, toxins, pollutants, drugs and drug metabolites metabolome

Where do the metabolites come from?
Metabolites synthesized from small molecule precursors Exogenous compounds coming from the diet, including those common with human metabolites Pharmaceuticals including antibiotics that alter and are altered by the microbiome Metabolite pool in tissues and biofluids Metabolites arising from commensal bacteria in the human gut (and other microbiomes) Environmental chemicals and toxins Metabolites specific to invasive, infecting microorganisms Taken from Stephen Barnes slides, UAB metabolomics workshop 2013

Metabolomic experiment typical goals
Differentiate groups Can we see metabolite differences between sample groups (e.g. wild type & mutant)? Quantification Can we measure metabolite differences? Identification Can we identify the metabolites that have changed? Systems biology integration How do these metabolites interact with DNA, RNA and proteins?

Untargeted vs targeted metabolomics
Literature Untargeted experiment Other omics Untargeted experiment Targeted experiment Measure unexpected changes in known (and unknown) metabolites Test hypothesis of expected changes in known metabolites Global metabolite profile with relative quantification >1000s metabolites measured Small number identified No chemical standards required Quantification of specific metabolites Approx. 20 metabolites measured Identify all with confidence Requires chemical standard Generate hypothesis for these metabolites

Metabolomics untargeted workflow
Alonso, Arnald, Sara Marsal, and Antonio Julià. "Analytical methods in untargeted metabolomics: state of the art in 2015." Frontiers in bioengineering and biotechnology 3 (2015): 23.

1st part of training Alonso, Arnald, Sara Marsal, and Antonio Julià. "Analytical methods in untargeted metabolomics: state of the art in 2015." Frontiers in bioengineering and biotechnology 3 (2015): 23.

Sample preparation: Metabolite pre-extraction
Quenching (stopping unwanted biochemical reactions) Inhibit enzymatic activity How? Methanol (cold < −40°C) or liquid nitrogen, causes a sudden temperature shock. Drying Inhibit enzymatic activity and microbial growth Water can effect the “solvation power of the extraction solvents” Blast Nitrogen onto samples Mushtaq, Mian Yahya, et al. "Extraction for metabolomics: access to the metabolome." Phytochemical analysis 25.4 (2014):

Sample preparation: Metabolite extraction
Metabolome can be divided into those that are water-soluble and those that are not. Extraction methods depend on sample type and the metabolites under investigation. For polar metabolites (e.g. sugars, amino acids, alkaloids): 2.5 Methanol: 1 water For non-polar metabolites (e.g. lipids): 1 methanol : 1 chloroform : 0.9 water [1] biphasic polar and non-polar separated Can actually be used for both polar and non polar extraction [1] Bligh and Dyer 1959

Measurement Technologies
Nuclear Magnetic Resonance (NMR) spectroscopy Mass spectrometry (MS) Imaging MS

NMR - Background 1948, Varian, founded in San Francisco, measured the gyro-magnetic ratio of certain atoms. This effect later became known as nuclear magnetic resonance

NMR - Background Atomic nucleus is a spinning charged particle
Generates magnetic field Natural spin state is spin is random When external magnetic field applied the nuclei align with or against the field Using radio frequency corresponding to a specific set of nuclei will cause a flip from the alpha to Beta spin state Relaxation of the nuclei to their original spin state emits characteristic electromagnetic signals Captured as function of signal intensity vs time Time-domain converted into frequency-domain through Fourier Transformation (FT). “The location, shape, and area of the signals in each spectrum provide spatial and connectivity information about the nuclei in the sample” [2] [1] [2] [1] [2]

NMR - metabolomics One-dimensional (1D) 1H NMR is the most widely used NMR approach in metabolomics 13C, 15N, and 31P can also be used Two dimensional (2D) NMR methods offer unambiguous identification of metabolites. NMR Benefits: Very reproducible Non destructive of sample Typically no separation required [1] [1] Hao, Jie, et al. "Bayesian deconvolution and quantification of metabolites in complex 1D NMR spectra using BATMAN." Nature protocols 9.6 (2014):

NMR - popular manufacturers
AVANCE III HD Fourier 300 HD JNM-ECZS

MS - Background Joseph John Thomson (1856 – 1940). Amongst other things, created the first mass spectrometer (then called a parabola spectrograph) for the determination of mass-to-charge ratios of ions ca at University of Cambridge. [1] [2] [1] GWS - The Great War: The Standard History of the All Europe Conflict (volume four) edited by H. W. Wilson and J. A. Hammerton [2] Cambridge station ca

MS – Simple schematic Simplistically, a mass spectrometer has 3 components (ion source, m/z analyser and detector) inlet Ion source m/z analyser Detector Data system

MS – ion source Can only detect charged ions, so we need to generate gas-phase ions with an ion source Upon ionisation of a sample molecule (M) a molecular ion (M+ or M- ) is formed Common types Atmospheric chemical ionisation (APCI) Electron ionisation (EI) Electrospray ionisation (ESI) Matrix-assisted laser desorption/ionisation (MALDI) EI used for GC-MS ESI and APCI used for LC-MS inlet Ion source m/z analyser Detector Data system

MS – Mass Analyser Separation of the ions based on their mass-to-charge (m/z) ratio Use either magnetic or electric fields Common types: Triple quadrupole (QQQ) Time-of-flight (TOF) Quadrupole TOF (qTOF) Orbitrap Fourier Transform Ion Cyclone Resonance (FT-ICR) inlet Ion source m/z analyser Detector Data system

MS - Background Detector
Detects the ion beams generated from the m/z analyser. Types Electron multipliers Faraday cups Photographic plates inlet Ion source m/z analyser Detector Data system

Example MS spectra Arginine Measure the ions
[M+H] Arginine Neutral exact mass: Measure the ions Adduct + Exact mass (Expected) (Observed) [M+H] [M+Na] [M+Na] (data originally from MassBank, LC-ESI-QTOF)

Isotopes Atoms with the same # protons but different # neutrons
12C : Mass of atom: , natural abundance: % 13C : Mass of atom: , natural abundance: 1.07 % 12C 13C C6H12O6: Theoretical isotopic distribution and mass spec

Tandem MS (MS/MS) Overview Tandem in space Tandem in time
MS1: Samples ionised, separated by m/z and then detect Isolation: Isolate an ion of interest Fragmentation: Create fragment ions from collision induced dissociation (CID) or other methods MS2: For the fragment ions separate by m/z and then detect Tandem in space Quadrupole Time of Flight (TOF) Tandem in time Ion trap Allows MSn To achieve detailed fragmentation with soft ionization techniques, such as electrospray ionization (ESI) in conjunction with liquid chromatography (LC), a collision step is normally required, which can be attained through collision-induced dissociation (CID) or higher-energy collisional dissociation (HCD). The term tandem mass spectrometry (MS/MS or MS2) is used when a single collision step is used but fragment ions can be isolated for further collision to provide MS3 spectra or more.

MS – popular manufacturers
Waters Thermo Fisher Agilent AB Sciex Bruker Thermo Fisher Q Exactive (quadrupole orbitrap) Thermo Fisher Oribitrap Elite (ion trap orbitrap) Xevo G2-XS QTof (Quadrupole Time-of-Flight)

MS – a lot of choice! This is just with Thermo fisher Orbitrap systems…

NMR vs MS NMR MS Reproducibility Very good fair Sensitivity
Less sensitive (≈100 metabolites) Very sensitive (>1000 metabolites) Sample recovery non-destructive destructive Sample preparation Minimal (no separation required) Depends (typically separation performed but not always) Sample used Typically µL but microcoil probes can be 5-10 µL Very low µL Popularity (Number of papers in PubMed*) 1780 4644 * (((Mass spectrometry) OR MS) AND Metabolomics) AND ("2013"[Date - Publication] : "3000"[Date - Publication]) (((Nuclear Magnetic Resonance) OR NMR) AND Metabolomics) AND ("2013"[Date - Publication] : "3000"[Date - Publication])

MS vs NMR Can we use both and integrate?
Yes but expensive and requires expertise Bingol, Kerem, and Rafael Brüschweiler. "Two elephants in the room: new hybrid nuclear magnetic resonance and mass spectrometry approaches for metabolomics." Current opinion in clinical nutrition and metabolic care 18.5 (2015):

Separation technology
Very complex mixtures of compounds Peptides, oligonucleotides, sugars, nucelosides, organic acids, ketones, aldehydes, amines, lipids etc Some metabolites have the same mass but different structure (isomers) For more accurate annotation/identification of metabolites a separation technique is required e.g.: Solid phase extraction (SPE) Gas chromatography (GC) Capillary electrophoresis (CE) High Performance Liquid chromatography (HPLC) Ultra High Performance Liquid chromatography (UHPLC)

LC-MS spectra 3 Dimensions Time Intensity m/z

Metabolomic data repositories
Experimental data and meta data from Metabolomic studies are stored in MetaboLights (here at EBI Cambridge) Metabomics Workbench Makes research more reproducible and open Facilitates new research!

MetaboLights repository
Requires submission of studies in ISA- Tab file format Hierarchical structure of files for recording experimental, sample and study design information Investigation file Study file(s) Assay file(s)

mzML2ISA & nmrML2ISA software
Submission to MetaboLights requires is time consuming and error prone Most instrument metadata in open source file format mzML2ISA & nmrML2ISA software Automatically create semi-complete ISA-Tab files Reduces time and user error API, CLI, GUI and Galaxy interface Documentation: Youtube:

Workflow Documentation: http://2isa.readthedocs.io/en/latest/
Youtube:

mins Data processing and analysis used in Metabolomics (focus on LC-MS) 30 mins Data integration 20 mins

Outline Metabolomics data processing and analysis (focus on mass spectrometry) Open source vs propriety Spectral processing Data analysis Annotation

Open source vs propriety software
Vendors (Thermo, AB Sciex, Waters, Agilent) all provide their own software Excalibur LipidSearch Compound discover MassLynx Progenesis Open source tools available for most things though. Recent survey [1] found most researches use open source software and commercial software bundled with the instrument XCMS the most popular metabolomics software [1] Weber, Ralf JM, Thomas N. Lawson, Reza M. Salek, Timothy MD Ebbels, Robert C. Glen, Royston Goodacre, Julian L. Griffin et al. "Computational tools and workflows in metabolomics: An international survey highlights the opportunity for harmonisation through Galaxy." Metabolomics 13, no. 2 (2017): 12.

Spectral processing (focus on LC-MS)

LC-MS spectral processing and adduct workflow
Raw files mzML files msconvert CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts Adduct annotated peaklist

Raw file conversion Raw files mzML files msconvert Convert “raw” spectra (propriety vendor format) to open source format mzML Current open source standard for Mass spectrometry metabolomics and proteomics data msconvert (proteowizard) tool GUI and CLI Profile vs centroid

The Benefits of Open Source File Formats
Makes it easy for software developers and bioinformaticians Only need to write code for 1 file format Standardised data (and metadata) Don’t need to have access to proprietary software Open Source File Format instrumentation .mzML mass spectrometry .imzML imaging mass spectrometry .nmrML Nuclear magnetic resonance spectroscopy

Peak picking XCMS Chromatographic peak picking algorithms Parameters
Raw files mzML files msconvert XCMS Very popular open source data pre-processing software for LC-MS and GC-MS Chromatographic peak picking algorithms matchFilter Original algorithm Use for profile low resolution MS data centWave Use for centroid high resolution data Parameters XCMS online XCMS Feature detection

Peak picking Raw files mzML files msconvert XCMS Feature detection

Grouping Raw files mzML files msconvert The feature detection stage works on each file (sample or replicate) at a time We need to group chromatographic features between files based on the m/z and time range XCMS Feature detection Grouping

Retention time alignment
Raw files mzML files msconvert There can be drift between LC-MS runs which might need correction Can be required for larger studies Locally weighted scatterplot smoothing (LOESS) Use “well behaved” peak groups to calculate retention time deviations. OBI-warp Dynamic time warping algorithm XCMS Feature detection Grouping Retention time alignment Grouping

What parameters to use? Can use XCMS online as starting point
R documentation for parameters Workflow4metabolomics Galaxy implementation of XCMS and CAMERA Provide user friendly documentation of the tools [1] IPO R package can optimize parameters [2] [1] [2]

XCMS processing [1] Try it yourself!
Create user account for XCMS online Create single Job (MORE TO DO)

Try it yourself! XCMS processing [1] Open R studio

Adduct annotation: Initial grouping
Raw files mzML files msconvert Using the most intense features the data is divided into rough retention time groups. Based on 60% of the chromatographic peak fwhm (full width at half-maximum) CAMERA XCMS groupFWHM Feature detection Grouping Retention time alignment Grouping

Adduct annotation: Isotopes
Raw files mzML files msconvert CAMERA Look for C12/C13 isotope differences of m/z Checks if intensity profile matches for [M+]+ to [M+1]+ XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment Grouping

Adduct annotation: correlation
Raw files mzML files msconvert Using the extracted ion chromatograms (EIC) for each feature are used to calculate two types of correlation between features: Correlation across samples: CAS Correlation within samples : CPSi CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping

groupCorr: CAMERA relationship scoring
Pearson's correlation CAS: Intensity correlation across samples CPSi: Peakshape correlation for sample i ISO: if isotope relationship (1 or 0)

groupCorr: CAMERA relationship scoring
The relationship map can be used to build a network Nodes: features (peak) Weighted Edges: The score (above a threshold) If features have many connected close relationship they are likely to be from the same compound The networks are broken into smaller groups that are more connected “Highly- connected-subgraphs” algorithm R Creates peak correlation (peak cluster) groups e.g. Red and blue colours Potentially 2 different closely eluting compounds

Adduct annotation: find the adducts
Raw files mzML files msconvert All m/z-differences within a Peak correlation group are matched against a list of rules. CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts

Adduct annotation: find the adducts
Raw files mzML files msconvert Matches with the same molecular mass hypothesis (below a given relative error) are combined into “groups” CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts

Adduct annotation: results!
Raw files mzML files msconvert CAMERA XCMS groupFWHM Feature detection findIsotopes Grouping Retention time alignment groupCorr Grouping findAdducts Adduct annotated peaklist

Direct infusion mass spectrometry (DIMS)
No chromatography High throughput alternative to LC-MS or GC-MS Protocol for complete experimental and data analysis workflow using spectral-stitching method [1] Other software to process DI-MS data XCMS msPurity MI-pack (annotation) Southam, A. D., Weber, R. J., Engel, J., Jones, M. R., & Viant, M. R. (2016). A complete workflow for high-resolution spectral-stitching nanoelectrospray direct-infusion mass-spectrometry-based metabolomics and lipidomics. Nature Protocols, 12(2),

Try it yourself! CAMERA

Other important processing steps
Blank filter Sample intensity should be > 5 times that of blank intensity RSD < 20% for samples and QCs*? Feature found in QC samples? *QC (quality control samples): Pool of all the samples

Data analysis

Feature matrix Similar data analysis approaches used for metabolomics and other ‘omics technologies. Large data matrices of samples vs features Look for fold changes between sample groups Univariate statistics Multivariate statistics samples mz/RT features

Univariate statistics
Experimental design Normal distribution (compare means) For from normal (compare medians) Compare two unpaired groups Unpaired t-test Mann-Whitney Compare two paired groups Paired t-test Wilcoxon signed-rank Compare more than two unmatched groups One-way ANOVA with multiple comparison Kruskal Wallis Compare more than two matched groups Repeated-measures ANOVA Friedman Multiple testing correction e.g. Benjamin Hochberg Vinaixa, Maria, et al. "A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data." Metabolites 2.4 (2012):

Multivariate statistics
Simultaneous observe multiple characteristics Some classic multivariable model assumptions are not fulfilled for chemometric / ‘omic data Less observations than variables Correlations between variables However, various methods can still be used or have been adapted for chemometrics / ‘omics

Multivariate data matrix convention
mz RT features samples

Principal component analysis (PCA)
Addresses the following problems with multivariate datasets: Visualisation of data with more than 3 variables not possible High correlation between samples makes many statistical methods not applicable Many variables contain only very few information A good first visual check for any multivariate dataset (outliers) Classic multivariate statistics states that we should have more samples than variables PCA is unsupervised so less susceptible to this problem (compared to PLS-DA)

PCA (basics) figure 1 Eigenvector and eigenvalues
Principal component analysis is a good name! It takes the principal components of the data Directions where there is the most variance (most spread out) Simple example with 2 variables Variance is spread equally across both source variables in figure 1 If we use two new axis (figure 2) then we can explain the majority of variance in 1 dimension Reduces dimensionality of data For many variables it can be very useful Eigenvector and eigenvalues Eigenvector: The direction of the line Eigenvalue: Spread of the data on the line (variance) Principal component 1 (PC1) is the eigenvector with the largest amount of variance The largest number of PCs is the minimum number of rows and columns of the matrix figure 2

PCA (plots) Scree plot Scores plot Loadings plot Bi plot
Show how much variance Is explained by the PCs Projections of PCs onto each sample Projections of PCs onto each metabolite Combine the scores and loading. Determine Which metabolite is “driving the separation” Data and plots from MetaboAnalyst.ca

Partial least squares discriminant analysis (PLS-DA)
Supervised method Describe the difference between classes of the samples (e.g. wild type, mutant) maximizes the covariance between the X variables and the Y variables Validation and cross validation step required Use R2 (and Q2) to measure how well the prediction performed Use PLS-DA with caution… Low number of samples gives a high risk of overfitting R2 and Q2 not ideal for categorical data R2 : coefficient of variance (how well the regression fits) Q2: bit unclear…. But similar measure of goodness of prediction PCA PLS-DA Gromski, Piotr S., et al. "A tutorial review: Metabolomics and partial least squares-discriminant analysis–a marriage of convenience or a shotgun wedding." Analytica chimica acta 879 (2015):

Missing value imputation
Important for PCA and PLS-DA Missing values in metabolomics datasets observed when: Metabolite is not present in the sample Metabolite is present in the sample but is present below the LOD. Metabolite is present in the sample but is missed during reprocessing Typical first step is to just filter out metabolites or samples that have > x % missing values For case 3 we assume the distribution of missing values is random and can use standard missing value imputation: Small value replacement K-nearest neighbour imputation (KNN) Random Forest imputation (RF) However, case 1 and 2 should really be treated differently In practice this is difficult, and following filtering, only 1 type of missing value imputation is typically performed Centering always needed for PCA and PLS-DA See [1] for recommendations for PCA, PLS-DA and univariate statistics

Normalisation, transformations and scaling
Remove variation in the measured response unrelated to the biological differences between samples e.g. Slight differences in preparation and collection for each sample Use sum, mean or median as a normalisation factor Probabilistic quotient normalisation (PQN) Transformation Correct for skewed data and heteroscedasticity Metabolites with large intensities typically have more variation than those with lower intensities Logarithmic transformations natural logarithm (nLog) generalised logarithm (gLog) Scaling Adjust for differences in fold change between metabolites which may be caused by large differences in the variation of the measured responses After scaling, the values are not dependant on the absolute abundance Autoscaling: every peak is mean centered and divided by the standard deviation of the column Pareto scaling: every peak is mean centered and divided by the square root of the standard deviation of the column Logarithms for transformation: (often a constant value is added to cope with near-zero values) Heteroscedasticity – “variability of a variable is unequal across the range of values of a second variable that predicts it.” i.e. does the variability vary as the intensity increases?

Try it yourself! Load in csv file from previous R session into MetaboAnalyst

Level of annotation level Name 1 Identified metabolites * 2
Putatively annotated compounds 3 Putatively characterised compound classes 4 Unknown compounds * 2 or more orthogonal proprieties of a chemical standard compared to experimental data using the same analytical methods Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2(1), 13.

Level of annotation level Name 1 Identified metabolites * 2
Putatively annotated compounds 3 Putatively characterised compound classes 4 Unknown compounds Often first step is to search Compound libraries * 2 or more orthogonal proprieties of a chemical standard compared to experimental data using the same analytical methods Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2(1), 13.

Useful metabolomic databases for annotation
Pathway databases KEGG MetaCyc Compound databases PubChem ChEBI ChemSpider Spectral databases HMDB MassBank Metlin mzCloud Download spectral libraries

Rumsfeld Annotation Quadrant
Expected by Analyst Unexpected by Analyst Known Knowns Expected and found Unknown Knowns Not expected but found Identified by library Known Unknowns Expected but not found Unknown Unknowns Not expected and not found Not identified by library

Neutral mass lookup Search neutral exact mass to exact mass found in databases of known compounds Biologically relevant KEGG, HMDB, MetCyc, ChEBI All compounds PubChem, ChemSpider Often many compounds will match 1 neutral mass Difficult to determine which compound to choose for the annotation Software: MI-pack PUTMEDID Metabosearch

MS/MS and MSn spectral search
Fragmentation spectra can be more reliable that simple neutral mass lookup form MS1 spectra Fragmentation libraries can be from real experimental data or in silico generated The predictable fragmentation patterns of lipids allow for in silico libraries (e.g. LipidBlast) Search collected experimental spectra to library spectra Use spectral matching methods LipidBlast:

Spectral matching methods
Compare query spectra to library spectra Vectors can be either m/z or intensity or weighted intensity (most common) 𝑎 and 𝑏 are weight factors (different recommendations in literature) [𝑝𝑒𝑎𝑘𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦] 𝑚 ×[𝑚𝑎𝑠𝑠 (𝑚𝑧)] 𝑛 Calculate weighted intensity vector for Query (𝒙) and library (𝒚) i=(100,50, 10) mz=(100, 200, 300) m=0.5 n=2 weighted = (1000.5×1002 , 500.5×2002 , 100.5×3002) weighted = , , Horai, Hisayuki, Masanori Arita, and Takaaki Nishioka. "Comparison of ESI-MS spectra in MassBank database." 2008 International Conference on BioMedical Engineering and Informatics. Vol. 2. IEEE, 2008.

Spectral matching methods
Count peaks Simplest method Count matching peaks between query (𝒙) and library (y) vectors Dot product 𝑛=𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑣𝑒𝑐𝑡𝑜𝑟 𝑑𝑝=𝒙∙ 𝒚 = 𝑥 1 × 𝑦 𝑥 2 × 𝑦 2 …… + 𝑥 𝑛 × 𝑦 𝑛 Dot product cosine (cosine similarity) Bounds the result between 0 to 1 (when non-negative values are used) 1 being a perfect match dpc= 𝑖=1 𝑛 𝑥 𝑖 ∙ 𝑦 𝑖 𝑖=1 𝑛 𝑥 𝑖 2 ∙ 𝑖=1 𝑛 𝑦 𝑖 2 Many other methods PBM, pMatch, machine learning techniques x=(2,4,3) y=(2,1,1) =(2×2)+(4×1)+(4×1)=11 =11/((√ )×(√ )=0.834 Horai, Hisayuki, Masanori Arita, and Takaaki Nishioka. "Comparison of ESI-MS spectra in MassBank database." 2008 International Conference on BioMedical Engineering and Informatics. Vol. 2. IEEE, 2008.

Spectral matching (aligning)
But first the peaks of the need the query and library to be aligned !!! Often not discussed in methods of spectral matching but very important Very good explanation of the whole process here (Waters and MassBank approach) algorithm.aspx

Spectral matching limitations
Small number of metabolites covered in spectral libraries Different instrument types will have different spectra Different parameters used for instrument will effect spectra Collision induced dissociation (CID) Higher-energy collisional dissociation (HCD) The precursor ion purity What else are you fragmenting?

Typical MS/MS DDA run MS1 spectra scan 1 MS2 spectra scan 2

msPurity Simple metric to assess the contribution of the selected precursor for fragmentation 1 = most pure (all contribution from precursor) 0 = least pure (no contribution from precursor) Interpolates

Precursor Purity Isolate window around peak
For m/z of /- 0.5 Da 99.5 to 100.5 Target I = Total I = Purity (target/total) = 0.56 Resulting MS/MS spectra referred to as chimeric intensity Chimera creature in greek mythology was composed of parts of more than one animal 99.5 m/z 100.5

Interpolate the metric

Isolation efficiency Thermo Scientific Q ExactiveTM Focus.
0.5 Da window A: Isolation efficiency profile of ions: , and m/z. B: A simple linear model using B-spline polynomials to predict isolation efficiency based on isolation window position (negative values have been zeroed), adjusted R2: , F: 210, p-value < 0.001), Isolation window

Do we have to reply on fragmentation spectral libraries? (nope)
MetFrag Combinatorial fragmenter using various heuristics to speed up the process For each suspected compound produces in silico fragmentation spectra Match to experimental fragmentation spectra CSIFingerID machine learning techniques Determines “fingerprint” of experimental spectra Search fingerprint against molecular structure databases MSnPy Creates fragmentation tree networks to explain the spectra Determine possible molecular formulas based on the trees (To be published) Many more: SIRIUS, FT-Blast

Neutral exact mass lookup (1)
Try it yourself! Neutral exact mass lookup (1) PubChem search for neutral exact mass Use advanced search Search for neutral exact mass How many hits? Search for range around neutral mass : [EXMASS] Choose first hit What is the molecular weight? Why is it different to the exact mass? See REST access to use programmatically

Neutral exact mass lookup (2)
Try it yourself! Neutral exact mass lookup (2) Metabosearch Programmatic access to multiple libraries of compounds to do exact mass lookup Download Use your grouped_peaklist.csv from the previous activity Annotate the peak list using the standard settings

Spectral matching dot product cosine [1]
Try it yourself! Spectral matching dot product cosine [1] Query spectrum: mz = [200.43, , 98.5] intensity = [1000, 100, 200] Library spectrum mz = [200.43, , 98.5] intensity = [1000, 20, 10] Calculate weighted vectors for query and library spectrum

Spectral matching dot product cosine [2]
Try it yourself! Spectral matching dot product cosine [2] Calculate the dot product cosine of the two weighted vectors dpc= 𝑖=1 𝑛 𝑥 𝑖 ∙ 𝑦 𝑖 𝑖=1 𝑛 𝑥 𝑖 2 ∙ 𝑖=1 𝑛 𝑦 𝑖 2 What other similarity measure could we use?

Spectral matching with Massbank
Try it yourself! Spectral matching with Massbank Use the file saved as spectra_for_massbank.txt Consist of one column of mz and one of intensity What is the best hit? What is its match score?

Try it yourself! MetFrag

msPurity: Calculate precursor purity
Try it yourself! msPurity: Calculate precursor purity library(msPurity) library(xcms) ###################################### # Calculate purity of MS/MS spectra ###################################### msPurityDataPth <- system.file("extdata", "lcms", "mzML", package="msPurityData") msmsPths <- list.files(msPurityDataPth, full.names = T, pattern = "MSMS") msPths <- list.files(msPurityDataPth, full.names = T, pattern = "LCMS_") pa <- purityA(msmsPths, mostIntense = TRUE, # use the most intense peak for precursor interpol = 'linear', # linear interpolate iwNorm = TRUE, # uses default isolation window normalistaion ilim = 0.05, # remove noise from calculation isotopes = TRUE) # remove isotopes from calculation

msPurity: Link MS/MS to XCMS feature
Try it yourself! msPurity: Link MS/MS to XCMS feature ###################################### # Link MS/MS spectra to XCMS feature ###################################### xset <- xcmsSet(msmsPths) xset <- group(xset) xset <- retcor(xset) # Link XCMS feature to MS/MS pa <- frag4feature(pa, xset) Note this is also possible with the xcmsFragmets function from the xcms package

Metabolite pathway analysis
Over-representation analysis Input: List of differentially observed metabolites List of all metabolites measured (and annotated) Output: p-value (and q-value) of over represented pathways Enrichment analysis list of every metabolite measured (and annotated) and associated value e.g. log fold change or two values (one for each phenotype) Output Hypergeometric

mins Data processing and analysis used in Metabolomics (with a focus on mass spectrometry) 30 mins Data integration 20 mins

Outline Type of study Types of data integration
Types of statistical data integration Genomic scale reconstructions

Data integration review paper
The following slides rely heavily on the review paper: Cavill, Rachel, Danyel Jennen, Jos Kleinjans, and Jacob Jan Briedé. "Transcriptomic and metabolomic data integration." Briefings in bioinformatics 17, no. 5 (2016):

Type of multi-omic study

Repeated study Perform a different ‘omics study on separately prepared representative samples at a different time/place Advantages: Simple Potentially easier for multiple laboratories Measurements considered statistically different Negatives: Batch effects (difficult or impossible to handle over very different technologies and experiments)

Split sample study Samples originate from the same biological source material e.g. tissue is homogenised and half goes to metabolomics and half goes to transcriptomics Often the ideal situation Advantages Limits batch effects between omics studies Note: will not remove within ‘omic batch affects Looking at the same “thing” Negatives: Feasibility Sample volume Preparation procedure might not allow RNA and metabolites extracted at the same time

Source-match study Use different fractions of the biological system for different ‘omics experiment. e.g. Blood and urine for Metabolomics and tissue for RNA analysis Advantages: Limits batch effects between studies Different source lend themselves better to different techniques Negatives: Additional consideration required for analysis

Replicate-matched study
Perform a different ‘omics study on separately prepared representative samples at the same time/place Advantages: Limits batch effects between studies Can be used if insufficient sample to do split-sample study Negatives: Not looking at the same replicate (adds unwanted variation)

Type of data integration

Types of statistical data integration

Correlation-based integration
Standard correlation coefficients Pearson’s (parametric) Spearman’s (non-parametric) Goodman and Kruskal gamma test only takes into account the up/down regulation of each metabolite/gene Partial correlations consideration Evaluate those correlations that are independent of the other co-linear measurements. what is the independent correlation of gene A and metabolite B, given that they are both correlated to gene C

Correlation-based integration
Problems Biochemical pathways often don’t correlate with expected corresponding biochemical enzymes [1] Changes in the metabolome and the transcriptome will not be simultaneous Align the data through time first [2]? [1] Ter Kuile BH, Westerhoff HV. Transcriptome meets metabolome: hierarchical and metabolic regulation of the glycolytic pathway. [2] Cavill R, Kleinjans JCS, Briede´ JJ. Dynamic time warping for omics. PLoS One 2013;8:e Lu

Dataset concatenation-based integration
Simply concatenate the datasets together Use fold changes for each ‘omic dataset Perform analysis (typically multivariate) on new dataset Problems: Dominated by one dataset (e.g transcripts vs 100 metabolites) Typically get separate clusters for each ‘omic dataset. i.e. you don’t as many clusters between the ‘omics datasets Recent example of concatenation based integration: Huang, Susie SY, et al. "A multi-omic approach to elucidate low-dose effects of xenobiotics in zebrafish (Danio rerio) larvae." Aquatic Toxicology 182 (2017):

Multivariate-based integration
Perform multivariate analysis but do not concatenate Many methods and can get very complicated However, possibly the most appropriate route for analysis

Multivariate-based integration: O2PLS
Two-way Orthogonal partial least squares Describe the relationships between two (or more) datasets: Map the connected variation between datasets Find unique variation within a dataset One of many methods for multivariate-based integration [1] [1] Bylesjö, Max, et al. "Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data." The Plant Journal 52.6 (2007): [2] El Bouhaddani, Said, et al. "Evaluation of O2PLS in Omics data integration." BMC bioinformatics. Vol. 17. No. Suppl 2. BioMed Central Ltd, 2016. [3] MKS data analytics solutions tutorial

Multivariate-based integration: O2PLS example
Projection of the individuals (scores plot) 30 mg (light grey) 180 mg(dark grey 1100 mg (black) blank (white). Loadings plot of metabolites (grey dots) and transcripts (black dots) levels. Magnified view on the most strongly upregulated (Ellipse 1) and downregulated (Ellipse 2) variables. Eveillard, Alexandre, et al. "Identification of potential mechanisms of toxicity after di-(2-ethylhexyl)-phthalate (DEHP) adult exposure in the liver using a systems biology approach." Toxicology and applied pharmacology 236.3 (2009):

Pathway-based integration
Pathway enrichment / over- representation analysis Consensus of the phenotype between datasets Considerations: How to combine the p- values? What ‘background lists’ to use

significant pathways (Metabolite)

significant pathways (Metabolite) significant pathways (Gene)

significant pathways (Metabolite) significant pathways (Gene) significant pathways (combined)

Pathway based integration
Try it yourself! Pathway based integration IMPaLA MetaboAnalyst

Type of data integration

Genome-Wide Metabolic Reconstruction
A structured representation of all biochemical metabolic reactions that take place within an organism. Bottom up approach to modelling metabolic capabilities of an organism Takes the annotated genome and finds enzyme encoding genes Infer metabolic pathways from the enzymes This is known as a draft reconstruction Further manual curation required to achieve high accuracy

Available software Pathway tools METRONOME pipeline
Uses for BioCyc Requires reasonably good annotation of genome METRONOME pipeline METabolic Reconstruction Of New genOMe sEquences Python package for draft reconstruction where there is minimal annotation

That’s it! Thank you for listening Acknowledgments:
Ralf Weber, Martin Jones, James Bradbury, Warwick Dunn, James Bradbury

birmingham.ac.uk/bmtc @BirmMetTrain bmtc@contacts.bham.ac.uk
Providing training to empower the next generation of metabolomics researchers The Birmingham Metabolomics Training Centre provides training to the metabolomics community in both analytical and computational methods. A combination of both face-to-face and online courses are provided. For full course listings, booking and more: 2017 Course List Face-to-Face Courses Introduction to Metabolomics for the Clinical Scientist 21st July 2017, 1st December 2017 Quality Assurance and Quality Control in Metabolomics 12th – 13th October 2017 Metabolite identification with the Q Exactive and LTQ Orbitrap 15th – 16th May 2017, 14th – 15th December 2017 Multiple Biofluid and Tissue Types, From Sample Preparation to Analysis Strategies for Metabolomics 5th – 7th June 2017, 6th – 8th December 2017 Metabolomics with the Q Exactive 3rd – 5th April 2017, 6th – 8th November 2017 Introduction to Metabolomics for the Microbiologist 20th - 22nd November 2017 Online Courses Metabolomics: Understanding Metabolism in the 21st Century 8th May – 2nd June 2017 Metabolomics Data Processing and Data Analysis 20th February – 17th March 2017 birmingham.ac.uk/bmtc @BirmMetTrain

PCA further dummies-eigenvectors-eigenvalues-and-dimension-reduction/ vignette.pdf #sect694 van den Berg, R. A., Hoefsloot, H. C., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics, 7(1), 142.

Introduction to metabolomics and data integration

Similar presentations

Presentation on theme: "Introduction to metabolomics and data integration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to metabolomics and data integration

Similar presentations

Presentation on theme: "Introduction to metabolomics and data integration"— Presentation transcript:

Similar presentations

About project

Feedback