Multivariate Analysis of Pathways. Multivariate Approaches to Gene Set Selection.

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
ADVANCED STATISTICS FOR MEDICAL STUDIES Mwarumba Mwavita, Ph.D. School of Educational Studies Research Evaluation Measurement and Statistics (REMS) Oklahoma.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Dimension reduction (1)
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Hypothesis testing Week 10 Lecture 2.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Copyright 2004 David J. Lilja1 What Do All of These Means Mean? Indices of central tendency Sample mean Median Mode Other means Arithmetic Harmonic Geometric.
QUANTITATIVE DATA ANALYSIS
Differentially expressed genes
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 11: Inference for Distributions
Quantitative Genetics
Modeling the Gene Expression of Saccharomyces cerevisiae Δcin5 Under Cold Shock Conditions Kevin McKay Laura Terada Department of Biology Loyola Marymount.
Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
Unsupervised learning
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Integrating Biology and Statistics: Gene Set Methods BIOS Winter/Spring 2010.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Handling nonnumerical variables (2) Sections 6.3—6.6 Kenrick Bingham
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
CGH Data BIOS Chromosome Re-arrangements.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Nonparametric Statistics
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Exploring Microarray data
Quality Control at a Local Brewery
Computational Diagnostics
Dimension reduction : PCA and Clustering
Volume 3, Issue 1, Pages (July 2016)
Presentation transcript:

Multivariate Analysis of Pathways

Multivariate Approaches to Gene Set Selection

Key Multivariate Ideas PCA (Principal Components Analysis) SVD (Singular Value Decomposition) MDS (Multi-dimensional Scaling) Hotelling T 2

PCA Three correlated variables PCA1 lies along the direction of maximal correlation; PCA 2 at right angles with the next highest variation.

Multivariate Representation of Pathways BAD pathway Normal IBC Other BC Clear separation between groups Variation differences

Compute distance between sample means using (common) metric of covariation Where Multidimensional analog of t (actually F) statistic Hotelling’s T 2

Principles of Kong et al Method Normal covariation generally acts to preserve homeostasis The transcription of genes that participate in many processes will be changed The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Critiques of Hotelling’s T Small samples: unreliable  estimates –N < p Estimates of  x and  not robust to outliers Assumes same covariance in each sample –   =   ? Usually not in disease –Kong et al propose analog of Welch t-test –Permutation in samples for significance

Making it Stable 1.Insufficient information to capture all relationships – too much correlation! –Power of Hotelling’s method comes from identifying directions of rare variation –Many (spurious) directions of 0 variation 2.Random variation in data leads to random variation in PCA Regularization strategy: force covariance to be more like IID

Making it Robust Microarray data has many outliers Multivariate methods are very much distorted by outliers Robust estimates of covariance could give robust PCA Simple approach: trim outliers

Handling Changes of Covariance Power of Hotelling’s method comes from identifying directions of rare variation If one group shows little covariation in one direction but the other does – how to test for changes? If one group is control then its rare covariance changes should be taken as standard –Robust measure of means in both groups

Detecting changes of covariance

Meaning of Covariance Change Meaning of covariance across individuals –Homeostasis in face of individual variation –e.g. BAD pathway: largest loadings of PC1 on PRKARB & ADCY1 –PRKARB represses CREB1; ADCY activates CREB1 Gene sets whose covariance diminishes may –be responding to different inputs –have escaped their usual regulatory control Characteristic of cancers

Testing Covariance Changes Idea: directions of small variation in one should match directions of small variation in other Mathematical approach –Find solutions of S 1 – S 2 –Solutions should all be near 1, if no change –Test statistic: easily computed Computational approach –Ratio of largest to smallest: max / min

Network Connectivity Methods

Network Topology Connections represent interactions: –Regulatory (one-way) –Protein interaction (two-way) Hubs are genes with many connections Bottlenecks are single genes that connect two parts of a functional network

Devising Tests Based on Topology Issues: how to weight more heavily the genes that are hubs How to assess directionality of change How to measure co-operativity (activation or repression changes in appropriate ways)

Draghici et. al. Approach Overall measure Effective contribution (perturbation factor)

Analysis of Outliers

Outliers: Clues to Disease Process? Outliers usually reflect idiosyncratic events Recurrent outliers reflect rare events that are selected If a particular pathway is disrupted in disease, but by many different mechanisms, then the expression profiles should –Lose healthy covariance –Show recurrent outliers How to test for ‘consistent’ outliers? COPA: a method for flagging recurrent outliers in expression data –Finds consistent fusion gene

A Test Statistic for Consistent Outliers Ratio of quantile differences to normal variation: (q.90 – q.10 ) tumor /max( (q.9 - q.1 ) normal,0.4) Compare to null distribution by permutation Many genes show much higher ratios

Statistical Significance Find false positives confidence limits by permutations Several hundred genes appear significant at 10-20% FDR –Actual scores: 267 scores are greater than 5, where 90% of permutations have fewer than 34 scores over 5

A Test for Functional Groups For each group G of genes s G <- sum(scores[G])/sqrt(length(G)) Scores: t-scores or range ratios PAGE (BMC Bioinformatics, 2005)

Do Genes Make Sense? Quantile Ratio [1] "DNA replication" [2] "response to pathogenic fungi" [6] "cleavage of lamin" [7] "spindle organization and biogenesis" [15] "response to osmotic stress" [16] "nutrient import" [22] "response to mercury ion" T-test [2] "sodium ion homeostasis" [3] "leukocyte adhesive activation" [4] "positive regulation of calcium-independent cell-cell adhesion" [5] "oxytocin receptor activity" [6] "ADP biosynthesis" [7] "dADP biosynthesis" [10] "regulation of muscle contraction" [11] "caveolar membrane" [12] "response to cold" [16] "stress fiber formation" [18] "positive regulation of complement activation" [19] "astrocyte activation" [22] "regulation of long-term neuronal synaptic plasticity" [24] "positive regulation of endocytosis" [25] "embryonic hemopoiesis"

Cancer Functional Groups Do very probable cancer genes show high- discrepancy in few samples? Program: identify genes that might contribute to cancer processes: growth signaling, loss of cell-matrix adhesion, apoptosis 1.Do most samples from these categories show at least one gross mis-regulation? 2.Are they the same genes in most samples?

Example: Cell Growth Select genes in GO: ‘regulation of cell growth’ Expect most samples to have at least one very serious mis-regulated gene from this category. Compute maximum aberration score across category

Aberrations Aberration score indicated by color: vanilla: 0; red: 4 Nine normals at left No gene misregulated in even 50% of samples BUT: Only a few genes commonly misregulated

Simplest Summary Maximum aberration score for samples

Testing the Pathway for Outliers Many genes show aberrations in tumor group Null distribution: medians of maxima from randomly selected gene groups of size 37 P <.01 NB. The results for cell-matrix interaction are very similar; angiogenesis not so strong