Canadian Bioinformatics Workshops

Slides:

Advertisements

Similar presentations

PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.

Advertisements

Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Visual Documentation v User Interface Active class (for selection and some processes)

Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.

© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

CALIBRATION Prof.Dr.Cevdet Demir

 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

Hub Queue Size Analyzer Implementing Neural Networks in practice.

Visual Documentation v User Interface Active class (for selection and some processes)

Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.

Metabolomic Data Processing & Statistical Analysis

Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.

Data Mining Chun-Hung Chou

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Chapter 2 Dimensionality Reduction. Linear Methods

Review of Ondex Bernice Rogowitz G2P Visualization and Visual Analytics Team March 18, 2010.

2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.

Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.

Networks and Interactions Boo Virk v1.0.

Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.

RNAseq analyses -- methods

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Report on Intrusion Detection and Data Fusion By Ganesh Godavari.

Gene expression analysis

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.

Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.

CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.

1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR

Experiment Design Overview Number of factors 1 2 k levels 2:min/max n - cat num regression models2k2k repl interactions & errors 2 k-p weak interactions.

Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.

Analyzing Expression Data: Clustering and Stats Chapter 16.

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton

Tutorial I: Missing Value Analysis

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Metabolomics MS and Data Analysis PCB 5530 Tom Niehaus Fall 2015.

Applying MetaboAnalyst

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Canadian Bioinformatics Workshops

Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.

Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.

Canadian Bioinformatics Workshops

Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,

EQUIPMENT and METHOD VALIDATION

Metabolomics Data Analysis

CellExpress Tutorial A Comprehensive Microarray-Based Cancer Cell Line and Clinical Sample Gene Expression Analysis Online System :8080 NTU.

Using Galaxy for Metabolomics

JMP Discovery Summit 2016 Janet Alvarado

A new R package statTarget Hemi Luan Hong Kong Baptist University.

Fig. 1. proFIA approach for peak detection and quantification

Genome Wide Association Studies using SNP

Canadian Bioinformatics Workshops

Gene expression analysis

What is Regression Analysis?

Metabolomics: Preanalytical Variables

Volume 3, Issue 1, Pages (July 2016)

The Omics Dashboard.

Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Metabolomic Data Analysis Using MetaboAnalyst Module 7 Metabolomic Data Analysis Using MetaboAnalyst David Wishart Informatics and Statistics for Metabolomics June 16-17, 2 014

Learning Objectives To become familiar with the standard metabolomics data analysis workflow To become aware of key elements such as: data integrity checking, outlier detection, quality control, normalization, scaling, etc. To learn how to use MetaboAnalyst to facilitate data analysis

A Typical Metabolomics Experiment

2 Routes to Metabolomics 1 2 3 4 5 6 7 ppm Quantitative (Targeted) Methods Chemometric (Profiling) Methods -25 -20 -15 -10 -5 5 10 15 20 25 -30 PC1 PC2 PAP ANIT Control 1 2 3 4 5 6 7 ppm hippurate urea allantoin creatinine 2-oxoglutarate citrate TMAO succinate fumarate water taurine

Metabolomics Data Workflow Chemometric Methods Targeted Methods Data Integrity Check Spectral alignment or binning Data normalization Data QC/outlier removal Data reduction & analysis Compound ID Data Integrity Check Compound ID and quantification Data normalization Data QC/outlier removal Data reduction & analysis

Data Integrity/Quality LC-MS and GC-MS have high number of false positive peaks Problems with adducts (LC), extra derivatization products (GC), isotopes, breakdown products (ionization issues), etc. Not usually a problem with NMR Check using replicates and adduct calculators MZedDB http://maltese.dbs.aber.ac.uk:8888/hrmet/index.html HMDB http://www.hmdb.ca/search/spectra?type=ms_search

Data/Spectral Alignment Important for LC-MS and GC-MS studies Not so important for NMR (pH variation) Many programs available (XCMS, ChromA, Mzmine) Most based on time warping algorithms http://mzmine.sourceforge.net/ http://bibiserv.techfak.uni-bielefeld.de/chroma http://metlin.scripps.edu/xcms/

Binning (3000 pts to 14 bins) xi,yi x = 232.1 (AOC) y = 10 (bin #) bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8...

Data Normalization/Scaling Can scale to sample or scale to feature Scaling to whole sample controls for dilution Normalize to integrated area, probabilistic quotient method, internal standard, sample specific (weight or volume of sample) Choice depends on sample & circumstances Same or different?

Data Normalization/Scaling Can scale to sample or scale to feature Scaling to feature(s) helps manage outliers Several feature scaling options available: log transformation, auto-scaling, Pareto scaling, probabilistic quotient, and range scaling MetaboAnalyst http://www.metaboanalyst.ca Dieterle F et al. Anal Chem. 2006 Jul 1;78(13):4281-90.

Data QC, Outlier Removal & Data Reduction Data filtering (remove solvent peaks, noise filtering, false positives, outlier removal -- needs justification) Dimensional reduction or feature selection to reduce number of features or factors to consider (PCA or PLS-DA) Clustering to find similarity

MetaboAnalyst Web server designed to handle large sets of LC-MS, GC-MS or NMR-based metabolomic data Supports both univariate and multivariate data processing, including t-tests, ANOVA, PCA, PLS-DA Identifies significantly altered metabolites, produces colorful plots, provides detailed explanations & summaries Links sig. metabolites to pathways via SMPDB http://www.metaboanalyst.ca

MetaboAnalyst Workflow Data pre-processing Data normalization Data analysis Data annotation

Two/multi-group analysis GC/LC-MS raw spectra Peak lists Spectral bins Concentration table Spectra processing Peak processing Noise filtering Missing value estimation Row-wise normalization Column-wise normalization Combined approach Data input Data processing Data integrity check Data normalization Functional Interpretation Statistical Exploration Enrichment analysis Pathway analysis Time-series analysis Two/multi-group analysis Over representation analysis Single sample profiling Quantitative enrichment analysis Enrichment analysis Topology analysis Interactive visualization Data overview Two-way ANOVA ANOVA - SCA Time-course analysis Univariate analysis Correlation analysis Chemometric analysis Feature selection Cluster analysis Classification Outputs Image Center Quality checking Other utilities Resolution: 150/300/600 dpi Format: png, tiff, pdf, svg, ps Methods comparision Temporal drift Batch effect Biolgoical checking Peak searching Pathway mapping Name/ID conversion Lipidomics Processed data Result tables Analysis report Images

MetaboAnalyst Overview Raw data processing Using MetaboAnalyst Data Reduction & Statistical analysis Functional enrichment analysis Using MSEA in MetaboAnalyst Metabolic pathway analysis Using MetPA in MetaboAnalyst

Example Datasets Click the “Data Formats” link

Example Datasets Right click the “download” link of the first example to save to your local computer

Metabolomic Data Processing

Common Tasks Purpose: to convert various raw data forms into data matrices suitable for statistical analysis Supported data formats Concentration tables (Targeted Analysis) Peak lists (Untargeted) Spectral bins (Untargeted) Raw spectra (Untargeted)

Data Upload Go back to the home page and Click “click here to start” to upload the file

Alternatively …

Data Set Selected Here we will be selecting a data set from dairy cattle fed different proportions of cereal grains (0%, 15%, 30%, 45%) The rumen was analyzed using NMR spectroscopy using quantitative metabolomic techniques High grain diets are thought to be stressful on cows

Data Integrity Check

Data Normalization

Data Normalization At this point, the data has been transformed to a matrix with the samples in rows and the variables (compounds/peaks/bins) in columns MetaboAnalyst offers three types of normalization, row-wise normalization, column-wise normalization and combined normalization Row-wise normalization aims to make each sample (row) comparable to each other (i.e. urine samples with different dilution effects)

Data Normalization Column-wise normalization aims to make each variable (column) comparable in scale to each other, thereby generating a “normal” distribution This procedure is useful when variables are of very different orders of magnitude Four methods have been implemented for this purpose – log transformation, autoscaling, Pareto scaling and range scaling

Normalization Result

Quality Control Dealing with outliers Noise reduction Detected mainly by visual inspection May be corrected by normalization May be excluded Noise reduction More of a concern for spectral bins/ peak lists Usually improves downstream results

Visual Inspection What does an outlier look like? Finding outliers via PCA Finding outliers via Heatmap

Outlier Removal

Noise Reduction

Noise Reduction (cont.) Characteristics of noise & uninformative features Low intensities Low variances (default)

Data Reduction and Statistical Analysis

Common tasks To identify important features To detect interesting patterns To assess difference between the phenotypes To facilitate classification or prediction NOW ON YOUR OWN

ANOVA

View Individual Compounds

Questions Q: Which compounds show significant difference among all the neighboring groups (0-15, 15-30, and 30-45)? Q: For Uracil, are groups 15, 30, 45 significantly different from each other?

Overall correlation pattern

High resolution image Specify format Specify resolution Specify size

Question Q: In untargeted metabolomics using NMR, researchers often look for region(s) on the spectra showing biggest change in their correlation patterns under different conditions. Can you do that in MetaboAnalyst? Hint: check the available parameters of Correlation analysis

Template Matching Looking for compounds showing interesting patterns of change Essentially a method to look for linear trends or periodic trends in the data Best for data that has 3 or more groups

Template Matching (cont.) Strong linear + correlation to grain % Strong linear - correlation to grain %

Question Q: Identify compounds that decrease in the first three groups but increase in the last group?

PCA Scores Plot

PCA Loading Plot Compounds most responsible for separation

3D-PCA

Question Q: Identify compounds that contribute most to the separation between group 15 and 45

PLS-DA Score Plot

Evaluation of PLS-DA Model PLS-DA Model evaluated by cross validation of Q2 and R2 More principal components to model improves quality of fit, but try to minimize this value 3 Component (3 PCs)model seems to be a good compromise here Good R2/Q2 (>0.7)

Important Compounds

Model Validation

Questions Q: What does p < 0.01 mean? Q: How many permutations need to be performed if you want to claim p value < 0.0001?

Heatmap Visualization Note that the Heatmap is not being clustered on Rows (i.e. the % grain in diet)

Heatmap Visualization (cont.)

Question Q: Identify compounds with a low concentration in group 0, 15 but increase in the group 35 and 45 Q: Which compound is the only one significantly increased in group 45?

Download Results

Analysis Report

Metabolite Set Enrichment Analysis

Metabolite Set Enrichment Analysis (MSEA) Web tool designed to handle lists of metabolites (with or without concentration data) Modeled after Gene Set Enrichment Analysis (GSEA) Supports over representation analysis (ORA), single sample profiling (SSP) and quantitative enrichment analysis (QEA) Contains a library of 6300 pre-defined metabolite sets including 85 pathway sets & 850 disease sets http://www.msea.ca or Metaboanalyst

Enrichment Analysis Purpose: To test if there are some biologically meaningful groups of metabolites that are significantly enriched in your data Biological meaningful Pathways Disease Localization Currently, only supports human metabolomic data

MSEA Accepts 3 kinds of input files 1) list of metabolite names only (ORA) 2) list of metabolite names + concentration data from a single sample (SSP) 3) a concentration table with a list of metabolite names + concentrations for multiple samples/patients (QEA)

The MSEA approach Over Representation Analysis Single Sample Profiling Quantitative Enrichment Analysis Compound concentrations Compound concentrations Compound concentrations Compare to normal references Compound selection (t-tests, clustering) Assess metabolite sets directly Important compound lists Abnormal compounds Find enriched biological themes ORA input For MSEA Metabolite set libraries Biological interpretation

Data Set Selected Here we are using a collection of metabolites identified by NMR (compound list + concentrations) from the urine from 77 lung and colon cancer patients, some of whom were suffering from cachexia (muscle wasting)

Start with a Compound List

Upload Compound List Normally GSEA would require a list of all known genes for the given platform. Here we just use the list of metabolites found in KEGG. ORA is a “weak” analysis in MSEA

Compound Name Standardization

Name Standardization (cont.)

Select a Metabolite Set Library

Result

Result (cont.)

The Matched Metabolite Set

Single Sample Profiling (Basically used by a physician to analyze a patient)

Single Sample Profiling (cont.)

Concentration Comparison

Concentration Comparison (cont.)

Quantitative Enrichment Analysis

Result

The Matched Metabolite Set

Question Q: Are these metabolites increased or decreased in the cachexia group?

Metabolic Pathway Analysis with MetPA

Pathway Analysis Purpose: to extend and enhance metabolite set enrichment analysis for pathways by Considering the pathway structures Supporting pathway visualization Currently supports 15 organisms

Data Upload

Data Set Selected Here we are using a collection of metabolites identified by NMR (compound list + concentrations) from the urine from 77 lung and colon cancer patients, some of whom were suffering from cachexia (muscle wasting)

Normalization

Pathway Libraries

Network Topology Analysis

Position Matters Which positions are important? Hubs Nodes that are highly connected (red ones) Bottlenecks Nodes on many shortest paths between other nodes (blue ones) Graph theory Degree centrality Betweenness centrality Junker et al. BMC Bioinformatics 2006

Which Node is More Important? High degree centrality High betweenness centrality

Pathway Visualization

Pathway Visualization (cont.)

Question Q: Which pathway do you think is likely to be affected the most? Why?

Result

Not Everything Was Covered Clustering (K-means, SOM) Classification (SVM, randomForests) Time-series data analysis Two factor data analysis Data quality checks Peak searching ….

Time Series Analysis in MetaboAnalyst

Quality Checking Module