Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
This demo will show the analysis functionality of Phenom-Networks based on a dataset generated in the Hebrew University, the Faculty of Agriculture in.
Advertisements

Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Visual Documentation v User Interface Active class (for selection and some processes)
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Hub Queue Size Analyzer Implementing Neural Networks in practice.
Visual Documentation v User Interface Active class (for selection and some processes)
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Version 4 for Windows NEX T. Welcome to SphinxSurvey Version 4,4, the integrated solution for all your survey needs... Question list Questionnaire Design.
Metabolomic Data Processing & Statistical Analysis
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Basic concepts in ordination
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Networks and Interactions Boo Virk v1.0.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
1 Labs for course #412 Analyzing Microarray Data using the mAdb System July 15-16, :00pm- 4:00pm First, look at the questions on the bottom of each.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 (21) EZinfo Introduction. 2 (21) EZinfo  A Software that makes data analysis easy  Reveals patterns, trends, groups, outliers and complex relationships.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Experiment Design Overview Number of factors 1 2 k levels 2:min/max n - cat num regression models2k2k repl interactions & errors 2 k-p weak interactions.
Chapter 3 Response Charts.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Project of CZ5225 Zhang Jingxian:
Analyzing Expression Data: Clustering and Stats Chapter 16.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
The Broad Institute of MIT and Harvard Differential Analysis.
Tutorial I: Missing Value Analysis
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Metabolomics MS and Data Analysis PCB 5530 Tom Niehaus Fall 2015.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Applying MetaboAnalyst
CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Metabolomics Data Analysis
CellExpress Tutorial A Comprehensive Microarray-Based Cancer Cell Line and Clinical Sample Gene Expression Analysis Online System :8080 NTU.
A new R package statTarget Hemi Luan Hong Kong Baptist University.
Exploring Microarray data
Fig. 1. proFIA approach for peak detection and quantification
Canadian Bioinformatics Workshops
Volume 3, Issue 1, Pages (July 2016)
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
The Omics Dashboard.
Cecal metabolome during C. difficile colonization and infection.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 5 Metabolomic Data Analysis Using MetaboAnalyst

Learning Objectives To become familiar with the standard metabolomics data analysis workflow To become aware of key elements such as: data integrity checking, outlier detection, quality control, normalization, scaling, etc. To learn how to use MetaboAnalyst to facilitate data analysis

A Typical Metabolomics Experiment

2 Routes to Metabolomics ppm Quantitative (Targeted) Methods Chemometric (Profiling) Methods

Metabolomics Data Workflow Data Integrity Check Spectral alignment or binning Data normalization Data QC/outlier removal Data reduction & analysis Compound ID Data Integrity Check Compound ID and quantification Data normalization Data QC/outlier removal Data reduction & analysis Chemometric Methods Targeted Methods

Data Integrity/Quality LC-MS and GC-MS have high number of false positive peaks Problems with adducts (LC), extra derivatization products (GC), isotopes, breakdown products (ionization issues), etc. Not usually a problem with NMR Check using replicates and adduct calculators MZedDB HMDB

Data/Spectral Alignment Important for LC-MS and GC-MS studies Not so important for NMR (pH variation) Many programs available (XCMS, ChromA, Mzmine) Most based on time warping algorithms

bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8... x i,y i x = (AOC) y = 10 (bin #) Binning (3000 pts to 14 bins)

Data Normalization/Scaling Can scale to sample or scale to feature Scaling to whole sample controls for dilution Normalize to integrated area, probabilistic quotient method, internal standard, sample specific (weight or volume of sample) Choice depends on sample & circumstances Same or different?

Data Normalization/Scaling Can scale to sample or scale to feature Scaling to feature(s) helps manage outliers Several feature scaling options available: log transformation, auto- scaling, Pareto scaling, probabilistic quotient, and range scaling MetaboAnalyst Dieterle F et al. Anal Chem Jul 1;78(13):

Data QC, Outlier Removal & Data Reduction Data filtering (remove solvent peaks, noise filtering, false positives, outlier removal -- needs justification) Dimensional reduction or feature selection to reduce number of features or factors to consider (PCA or PLS-DA) Clustering to find similarity

MetaboAnalyst A comprehensive web server designed to process & analyze LC-MS, GC-MS or NMR-based metabolomic data

MetaboAnalyst History 2009 v1.0 - Supports both univariate and multivariate data processing, including t- tests, ANOVA, PCA, PLS-DA, colorful plots, with detailed explanations & summaries 2012 v2.0 - Identifies significantly altered functions & pathways 2015 v3.0 – Better performance, better graphical interactivity, biomarker analysis, power analysis, integration with gene expression data …

MetaboAnalyst Overview Raw data processing Data reduction & statistical analysis Functional enrichment analysis Metabolic pathway analysis Power analysis and sample size estimation Biomarker analysis Integrative analysis

MetaboAnalyst Modules 17 Data pre- processing Data normalization Data analysis Data interpretation

MetaboAnalyst Modules

Example Datasets

Metabolomic Data Processing

Common Tasks Purpose: to convert various raw data forms into data matrices suitable for statistical analysis Supported data formats –Concentration tables (Targeted Analysis) –Peak lists (Untargeted) –Spectral bins (Untargeted) –Raw spectra (Untargeted)

Select a Module (Statistical Analysis)

Data Upload

Alternatively …

Data Set Selected Here we have selected a data set from dairy cattle fed different proportions of cereal grains (0%, 15%, 30%, 45%) The rumen was analyzed using NMR spectroscopy using quantitative metabolomic techniques High grain diets are thought to be stressful on cows

Data Integrity Check

Data Normalization Samples = rows Compounds = columns

Data Normalization At this point, the data has been transformed to a matrix with the samples in rows and the variables (compounds/peaks/bins) in columns MetaboAnalyst offers three types of normalization, row-wise normalization, column-wise normalization and combined normalization Row-wise normalization aims to make each sample (row) comparable to each other (i.e. urine samples with different dilution effects)

Data Normalization Column-wise normalization aims to make each variable (column) comparable in scale to each other, thereby generating a “normal” distribution This procedure is useful when variables are of very different orders of magnitude Four methods have been implemented for this purpose – log transformation, autoscaling, Pareto scaling and range scaling

Normalization Result

Data Normalization You cannot know a priori what the best normalization protocol will be MetaboAnalyst allows you to interactively explore different normalization protocols and to visually inspect the degree of “normality” or Gaussian behavior This example is nicely normalized

Next Steps After normalization has been completed it is a good idea to look at your data a little further to identify outliers or noise that could/should be removed

Quality Control Dealing with outliers –Detected mainly by visual inspection –May be corrected by normalization –May be excluded Noise reduction –More of a concern for spectral bins/ peak lists –Usually improves downstream results

Visual Inspection What does an outlier look like? Finding outliers via PCAFinding outliers via Heatmap

Outlier Removal (Data Editor)

Noise Reduction (Data Filtering)

Noise Reduction (cont.) Characteristics of noise & uninformative features –Low intensities –Low variances (default)

Data Reduction and Statistical Analysis

Common Tasks To identify important features To detect interesting patterns To assess difference between the phenotypes To facilitate classification or prediction We will look at ANOVA, Multivariate Analysis (PCA, PLS-DA) and Clustering

ANOVA Looking at 4 different dairy cow populations –0% grain in diet –15% grain in diet –30% grain in diet –45% grain in diet Try to identify those metabolites that are different between all groups or just between 0% and everything else

ANOVA Click this spot and the 3-PP graph pops up Click this to view the table

View Individual Compounds Click this to see the uracil graphs

What’s Next? Click and compare different compounds to see which ones are most different or most similar between the 4 groups Click on the Correlation link (under the ANOVA link) to generate a heat map that displays the pairwise compound correlations and compound clusters

Overall Correlation Pattern Click this to save a high res. image

High Resolution Image

What’s Next? When looking at >2 groups it is often useful to look for patterns or trends within particular metabolites Use Pattern Hunter to find these trends

Pattern Matching Looking for compounds showing interesting patterns of change Essentially a method to look for linear trends or periodic trends in the data Best for data that has 3 or more groups

Pattern Matching (cont.) Strong linear + correlation to grain % Strong linear - correlation to grain %

Multivariate Analysis Use PCA option to view the separation (if any) in the 4 groups Look at the 2D PCA Score Plot –2 most significant principal components Look at the 2D PCA Loading Plot Look at the PCA Plot in 3D –3 most significant principal components Options for viewing are located in the top tabs

PCA Scores Plot

PCA Loading Plot Compounds most responsible for separation Click on a point to view

3D Score Plot 55 Drag to rotate Mouse over to see sample names

Multivariate Analysis Use PLS-DA option to view the separation of the 4 (labeled) groups PLS-DA “rotates” the PCA axes to maximize separation Look at the 2D PLS Scores Plot Look at the Q 2 and R 2 Values (Cross Validation) Use the VIP plot to ID important metabolites (VIP > 1.2)

PLS-DA Score Plot

Evaluation of PLS-DA Model PLS-DA Model evaluated by cross validation of Q 2 and R 2 Using too many components can over-fit 3 component model seems to be a good compromise here Good R 2 /Q 2 (>0.7)

Important Compounds

Model Validation Note, permutation is computationally intensive. It is not performed by default. Users need to set the permutation number and press the submit button

Hierarchical Clustering (Heat Maps) An alternative way of viewing or clustering multivariate data Allows one to look at the behavior of individual metabolites Can ask questions such as: which compounds have a low concentration in group 0, 15 but increase in the group 35 and 45? or which compound is the only one significantly increased in group 45?

Heatmap Visualization Note that the Heatmap is not being clustered on Rows. It is ordered by the class labels

Heatmap Visualization (cont.)

What’s Next? Most of the multivariate analysis is now done MetaboAnalyst has been keeping track of the plots or graphs you have generated Now its time to generate a printed report that summarizes what you’ve done and what you’ve found

Download Results

Analysis Report

Select a Module (Enrichment Analysis)

Metabolite Set Enrichment Analysis (MSEA) Now part of Metaboanalyst Designed to handle lists of metabolites (with or without concentration data) Modeled after Gene Set Enrichment Analysis (GSEA) Supports over representation analysis (ORA), single sample profiling (SSP) and quantitative enrichment analysis (QEA) Contains a library of 6300 pre-defined metabolite sets including 85 pathway sets & 850 disease sets

Enrichment Analysis Purpose: To test if there are biologically meaningful groups of metabolites that are significantly enriched in your data Biological meaningful in terms of: –Pathways –Disease –Localization Currently, MSEA only supports human metabolomic data

MSEA Accepts 3 kinds of input files –list of metabolite names only (ORA – over representation analysis) –list of metabolite names + concentration data from a single sample (SSP – single sample profiling) –a concentration table with a list of metabolite names + concentrations for multiple samples/patients (QEA – quantitative enrichment analysis)

The MSEA Approach 73 Assess metabolite sets directly Compound concentrations Compound selection (t-tests, clustering) Abnormal compounds Compound concentrations Metabolite set libraries Over Representation Analysis Quantitative Enrichment Analysis Biological interpretation Find enriched biological themes Compound concentrations Important compound lists Single Sample Profiling Compare to normal references ORA input For MSEA ORA SSP QEA

Data Set Selected Here we are using a collection of metabolites identified by NMR (compound list + concentrations) from the urine from 77 lung and colon cancer patients, some of whom were suffering from cachexia (muscle wasting)

Start with a Compound List for ORA

Upload Compound List Normally GSEA would require a list of all known genes for the given platform. Here we just use the list of metabolites found in KEGG. ORA is a “weak” analysis in MSEA

Perform Compound Name Standardization

Name Standardization (cont.)

Select a Metabolite Set Library

Result

Result (cont.) Click on details to see more

The Matched Metabolite Set Click on SMPDB to see more information

Phenylalanine and Tyrosine Metabolism in SMPDB

Single Sample Profiling (SSP) (Basically used by a physician to analyze a patient)

Concentration Comparison

Concentration Comparison (cont.)

Quantitative Enrichment Analysis (QEA)

Result Click on details to see more

The Matched Metabolite Set

Select a Module (Pathway Analysis)

Pathway Analysis Purpose: to extend and enhance metabolite set enrichment analysis for pathways by –Considering pathway structures –Supporting pathway visualization Currently supports analysis for 21 diverse (model) organisms such as humans, mouse, drosophila, arabadopsis, E. coli, yeast, etc. (KEGG pathways only)

Data Set Selected Here we are using a collection of metabolites identified by NMR (compound list + concentrations) from the urine from 77 lung and colon cancer patients, some of whom were suffering from cachexia (muscle wasting)

Pathway Analysis Module

Data Upload

Perform Data Normalization

Select Pathway Libraries

Perform Network Topology Analysis

Pathway Position Matters 98 Junker et al. BMC Bioinformatics 2006  Which positions are important?  Hubs  Nodes that are highly connected (red ones)  Bottlenecks  Nodes on many shortest paths between other nodes (blue ones)  Graph theory  Degree centrality  Betweenness centrality

Which Node is More Important? High betweenness centrality High degree centrality

Pathway Visualization

Pathway Visualization (cont.)

Pathway Impact Incorporates parameters such as the log fold-change of the DE metabolites, the statistical significance of the set of pathway genes and the topology of the signaling pathway Combines the pathway topology with the over-representation evidence

Result

Select a Module (Biomarker Analysis)

Biomarker Analysis Purpose is to find biomarkers using ROC (receiver operator characteristic) curves with high sensitivity and specificity Maximize AUC under ROC curve while minimizing the number of metabolites used in the biomarker panel 3 different modules (univariate – single marker at a time, multivariate – many combinations of biomarkers, manual – user choice)

Select Test Data Set 1 Click Here

Data Set Selected 90 patients (expectant mothers) at 3 months pregnancy Serum samples 45 patients went on to develop pre- eclampsia at 6-7 months 45 patients had normal pregancies Trying to find biomarkers for predicting early pre-eclampsia

Perform Data Integrity Check Click Here

Perform Log Normalization Click Here

Check That It’s Normally Distributed Click Here beforeafter

Select Multivariate Option Click Here

View ROC Curve

Choose a Model (95% conf.) Select model

95% Confidence Interval

Select Sig. Features Tab Click Here

View VIP Plot

Select a Module (Power Analysis)

Statistical Power Statistical power is the ability of a test to detect an effect, if the effect actually exists –A power of 0.8 in a clinical trial means that the study has a 80% chance of ending up with a statistically significant treatment effect if there really was an important difference between treatments. To answer research questions: –How powerful is my study? –How many samples do I need to have for what I want to get from the study?

Statistical Power (cont.) The statistical power of a test depends: 1.Sample size, 2.Significance criterion (alpha) 3.Effect size Increase power Effect size Sample size Decrease Power Significance criterion

The Approach How do we get these values? –Effect size can be estimated from a pilot data; –Significance criteria Single metabolite - p value cutoff (i.e. 0.05, 0.01) Metabolomics data – FDR (i.e. 0.1) –Sample size is our interest –Power value is our interest You need to upload a pilot data, and set the criteria, MetaboAnalyst will compute a power vs. sample size curve by computing power values for a range of sample sizes [3, 1000]

Power vs. Sample size At least 60 samples/group will needed to get a power of 0.8

Not Everything Was Covered Clustering (K-means, SOM) Classification (SVM, randomForests) Time-series data analysis Two factor data analysis Integrative pathway analysis (gene and metabolite)

Time Series Analysis in MetaboAnalyst 123

Integrative Pathway Analysis