Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Dimension reduction (1)
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Bivariate Correlation and Regression CHAPTER Thirteen.
Multivariate Analysis of Pathways. Multivariate Approaches to Gene Set Selection.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
QUANTITATIVE DATA ANALYSIS
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Statistical Methods Chichang Jou Tamkang University.
Differentially expressed genes
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
1 Test of significance for small samples Javier Cabrera.
Chapter 11 Multiple Regression.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Chapter 14 Inferential Data Analysis
Nonparametrics and goodness of fit Petter Mostad
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Selecting the Correct Statistical Test
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Chapter 3 Data Exploration and Dimension Reduction 1.
Gene Set Enrichment Analysis (GSEA)
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Previous Lecture: Categorical Data Methods. Nonparametric Methods This Lecture Judy Zhong Ph.D.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Integrating Biology and Statistics: Gene Set Methods BIOS Winter/Spring 2010.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Regression Analysis. 1. To comprehend the nature of correlation analysis. 2. To understand bivariate regression analysis. 3. To become aware of the coefficient.
Instructor: Dr. Amery Wu
Tutorial I: Missing Value Analysis
CGH Data BIOS Chromosome Re-arrangements.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
MATH-138 Elementary Statistics
Clustering Manpreet S. Katari.
Exploring Microarray data
Genome Wide Association Studies using SNP
Genesets and Enrichment
Descriptive Statistics vs. Factor Analysis
Dimension reduction : PCA and Clustering
Multidimensional Scaling
Presentation transcript:

Pathway Analysis

Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’ (themes) Identify crucial points in process where intervention could make a difference Why? Biology is Redundant! Often sets of genes doing related functions are changed

Gene Sets Gene Ontology –Biological Process –Molecular Function –Cellular Location Pathway Databases –KEGG –BioCarta –Broad Institute

Other Gene Sets Transcription factor targets –All the genes regulated by particular TF’s Protein complex components –Sets of genes whose protein products function together Ion channel receptors RNA / DNA Polymerase Paralogs –Families of genes descended (in eukaryotic times) from a common ancestor

Approaches Univariate: –Derive summary statistics for each gene independently –Group statistics of genes by gene group Multivariate: –Analyze covariation of genes in groups across individuals –More adaptable to continuous statistics

Univariate Approaches Discrete tests: enrichment for groups in gene lists –Select genes differentially expressed at some cutoff –For each gene group cross-tabulate –Test for significance (Hypergeometric or Fisher test) Continuous tests: from gene scores to group scores –Compare distribution of scores within each group to random selections –GSEA (Gene Set Enrichment Analysis) –PAGE (Parametric Analysis of Gene Expression)

Multivariate Approaches Classical multivariate methods –Multi-dimensional Scaling –Hotelling’s T 2 Informativeness –Topological score relative to network –Prediction by machine learning tool e.g. ‘random forest’

Contingency Table – 2 X 2 Signif. Genes NS Genes Group of Interest kn-kn OthersK-k(N-n)- (K-k) N-n KN-KN P =

Categorical Analysis Fisher’s Exact Test –Condition on margins fixed Of all tables with same margins, how many have dependence as or more extreme? –Hard to compute when n or k are large Approximations –Binomial (when k/n is small) –Chi-square (when expected values > 5 ) –G 2 (log-likelihood ratio; compare to  2 )

Issues in Assessing Significance P-value or FDR? –Heuristic only; use FDR If a child category is significant, how to assess significance of parent category? –Include child category –Consider only genes outside child category What is appropriate Null Distribution? –Random sets of genes? Or –Random assignments of samples?

Critiques of Discrete Approach No use of information about size of change Continuous procedures usually have twice the power of analogous discrete procedures on discretized continuous data No use of covariation –knowing covariation usually improves power of test

(2003)

GSEA Uses Kolmogorov-Smirnov (K-S) test of distribution equality to compare t-scores for selected gene group with all genes

Update Fixes a Problem Sometimes ranks concentrated in middle Hack: Ad-hoc weighting by scores emphasizes peaks at extremes

Group Z- or T- Scores Under Null Hypothesis, each gene’s z-score (z i ) is distributed N(0,1) Hence the sum over genes in a group G : Identify which groups have highest scores Same issues as discrete: –Null Distribution: permute which indices? –Hierarchy

Issues for Pathway Methods How to assess significance? –Null distribution by permutations –Permute genes or samples? How to handle activators and inhibitors in the same pathway? –Variance Test –Other approaches

Pathway Analysis of Genotype Data

The Pathways Proposal Complex disease ensues from the malfunction of one or a few specific signaling pathways Alternatives: 1.Common variants of several genes in the pathway each contribute moderate risk 2.Rare de novo variants confer great risk and persist for generations in LD with typed markers within unidentified subpopulations of the study group

Approach 1 - Adaptation of GSEA Order log-odds ratios or linkage p-values for all SNP’s Map SNP’s to genes, and genes to groups Use linkage p-values in place of t-scores in GSEA –Compare distribution of log-odds ratios for SNP’s in group to randomly selected SNP’s from the chip

Possible Association Models 1.Each of several genes may have a variant that confers increased RR independent of other genes 2.Several genes in contribute additively to the malfunction of the pathway 3.There are several distinct combinations of gene variants that increase RR but only modest increases in risk for any single variant

Approach 2 – Combining p-values 1. Compute gene-wise p-value: –Select most likely variant - ‘best’ p-value –Selected minimum p-value is biased downward –Assign ‘gene-wise’ p-value by permutations (Westfall- Young) Permute samples and compute ‘best’ p-value for each permutation Compare candidate SNP pvalues to this null distribution of ‘best’ p-values 2. Combine p-values by Fisher’s method

Methods – 2 Additive model: –Where n i indexes the number of allele B’s of a SNP in gene i in the gene set G –Select subset of most likely SNP’s –Fit by logistic regression (glm() in R) Significance by permutations –Permute sample outcomes –Select genes and fit logistic regression again Assess goodness of fit each time –Compare observed goodness of fit

Multivariate Approaches to Gene Set Analysis

Key Multivariate Ideas PCA (Principal Components Analysis) SVD (Singular Value Decomposition) MDS (Multi-dimensional Scaling) Hotelling T 2

PCA Three correlated variables PCA1 lies along the direction of maximal correlation; PCA 2 at right angles with the next highest variation.

Multi-Dimensional Scaling Aim: to represent graphically the most information about relationships among samples with multi-dimensional attributes in 2 (or 3) dimensions Algorithm: –Transform distances into cross-product matrix –Initial PCA onto 2 (or 3) axes –Deform until better representation Minimize ‘strain’ measure:

Separating Using MDS Left: distributions of individual variables Right: MDS plot (in this case PCA)

Multivariate Approaches to Selection Visualizing differences by MDS Hotelling’s T-squared

MDS for Pathways BAD pathway Normal IBC Other BC Clear separation between groups Variation differences

Compute distance between sample means using (common) metric of covariation Where Multidimensional analog of t (actually F) statistic Hotelling’s T 2

Principles of Kong et al Method Normal covariation generally acts to preserve homeostasis The transcription of genes that participate in many processes will be changed The joint changes in genes will be most distinctive for those genes active in pathways that are working differently

Critiques of Hotelling’s T Not robust to outliers Assumes same covariance in each sample –   =   ? Usually not in disease Small samples: unreliable  estimates –N < p