Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Issues in factorial design
1 Reliability in Scales Reliability is a question of consistency do we get the same numbers on repeated measurements? Low reliability: reaction time High.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Differentially expressed genes
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
. Differentially Expressed Genes, Class Discovery & Classification.
1 Test of significance for small samples Javier Cabrera.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Automatic methods for functional annotation of sequences Petri Törönen.
T-test Mechanics. Z-score If we know the population mean and standard deviation, for any value of X we can compute a z-score Z-score tells us how far.
T-distribution & comparison of means Z as test statistic Use a Z-statistic only if you know the population standard deviation (σ). Z-statistic converts.
Frédéric Schütz Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge Center.
Gene Set Enrichment Analysis (GSEA)
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Sample classification using Microarray Data. AB We have two sample entities malignant vs. benign tumor patient responding to drug vs. patient resistant.
Gene Expression Data Analysis Lab Session CAD course Jian Li
Analysis of large groups of genes Petri Toronen
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
BIOS6660 shRNAseq Gene Set Enrichment Analysis Tzu L Phang PhD Robert Stearman PhD April 16, 2014.
Top X interactions of PIN Network A interactions Coverage of Network A Figure S1 - Network A interactions are distributed evenly across the top 60,000.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
University of Durham D Dr Robert Coe University of Durham School of Education Tel: (+44 / 0) Fax: (+44 / 0)
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Reading Report: A unified approach for assessing agreement for continuous and categorical data Yingdong Feng.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
The Broad Institute of MIT and Harvard Differential Analysis.
GO enrichment and GOrilla
1 Limma homework Is it possible that some of these gene expression changes are miscalled (i.e. biologically significant but insignificant p value and vice.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Canadian Bioinformatics Workshops
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Canadian Bioinformatics Workshops
David Amar, Tom Hait, and Ron Shamir
Clustering Manpreet S. Katari.
Differential Gene Expression
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Presentation transcript:

Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi

What, Why, How… Gene expression data/analysis Problems with gene expression data analysis Earlier solutions My solution Comparisons Conclusions / Warnings

Genome-wide gene expression Genome-wide Gene Expression (GE) analysis. Standard lab tool Various methods Aim to understand biological differences across the samples at gene level If you don’t work with GE data: – Gene Set Methods can be used with most other large scale data sets

Typical pipelines Generate the GE data Pre-processing (Normalization etc.) Define Differentially Expressed genes Draw biological conclusions Find over-represented biological processes Generate the GE data Pre-processing (Normalization etc.) Define Differentially Expressed genes Cluster selected genes Draw biological conclusions Generate the GE data Pre-processing (Normalization etc.) Define Differentially Expressed genes Generate a classification of samples using GE profiles of genes Draw biological conclusions Classify unknown samples

What can go wrong? Is the definition of Differentially Expressed genes always reasonable? – datasets with large noise levels – p-value thresholds – sudden jump to signif. regulation – genes with weak regulation Is the set of Diff. Expr. genes the main goal?

What can go wrong? Is the definition of Differentially Expressed genes always reasonable? – datasets with large noise levels – p-value thresholds – genes with weak regulation Is the set of Diff. Expr. genes the main goal? =>Biological Processes are usually more informative.

What can go wrong? Analysis of data with one threshold. Biological process with weak regulation goes unnoticed

Solution Analyze sets of genes instead of genes Gene Set: Genes belonging to same pathway, biological process, complex and/or Gene Ontology class Benefits: Group of genes is less sensitive to error than a single gene* Benefits: Easy interpretation of the results Something to support the gene based analysis

Gene set analysis pipeline Generate the GE data Pre-processing (Normalization etc.) Define continuous Diff. Expr. score for genes Calculate a gene set score for each gene set Generate permuted data Pre-defined gene sets Calculate the gene set score for each gene set Look for gene sets that show stronger signal in real data than in permuted data Gene level Gene set level Expression data Class data Sample labels

Methods for gene set scoring Average based methods Rank based methods Other methods (omitted here)

Average based methods Calculate the average regulation of gene set (Tian et al. PNAS) Can something go wrong with it?

Rank based methods Steps: – order genes with differential expression – test every possible threshold in the ordered list – look over(/under)-representation of gene set above the threshold – select the strongest score Expression values are (often) discarded! Iterative Group Analysis, Kolmogorov-Smirnov test (KS), modified KS (Gene Set Enrichment Analysis package, MIT) Analyzed subset threshold Gene expression dataAnalyzed gene classes Black = class member White = not a member

Permutations Needed to evaluate significance Two types: Row Randomization – mix labels gene set / gene class Column Randomization – mix sample labels, used to calculate diff. expr. Column Randomization preferred Row rand. Col. rand

Summary of methods Average-based methods are weak with non- coherent regulation Rank-based methods usually omit gene expression values => steps between all genes equally significant

My brilliant proposal Combine two method groups: – Order genes with diff. expr. scores – Test every threshold position – At each threshold calculate Scale the difference with STD and average estimates (Toronen et al. 2009) Get a Z-score scaling for difference => Gene Set Z-score (GSZ)

My brilliant proposal An over-representation (hypergeometric) score weighted with diff. expr. score GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list. Considers both: Variance in the expr. values and variance in the number gene set members in the list

My brilliant proposal Many popular Gene Set scoring methods are variants of GSZ-method: – hypergeometric testing – Pearson correlation – Max-Mean (Efron, Tibshirani) – Random Sets (Newton et al.)

GSZ profile from ALL data (Chiaretti et.al) for one GO class vs. 7 quantiles (0, 5, 25, 50, 75, 95, 100) from 500 permutations. Different positions corresponds to other competing methods.

Evaluation Stability of the scores as threshold goes through the gene list? Red line: Strongest signal from positive data (across all GO classes) Blue lines: various quantiles (same as before) across all GO class Compare with KS and modified KS (Right column. MIT, PNAS and Nature Gen.) Same data, same permutation!! GSZ with diff. parameter values. Third box shows default parameter values. Pay attention to stability of blue lines.

More evaluation GSZ is also stable against the gene set size variations – most methods are not Several Gene Set scoring methods were tested with artificial positive and random datasets – GSZ showed best overall ability to separate two dataset types Methods were evaluated by splitting the real data to two halves: Test how well the results match – GSZ was best in predicting its own results from the other half – GSZ was best in predicting summary of all methods from the other half

More evaluation Compare different gene set scoring functions Test with two popular datasets against GO classes Calculate the empirical -log(p-values) for strongest GO classes from each method Blue line = GSZ, green line = T-test, red = KS, magenta = iGA, cyan = modified KS ALL dataset p53 dataset Pooled data Class data

More evaluation Select biologically relevant GO classes as biologically positive Look how many such classes each method finds across the top ranks (GSZ = blue line) Here ALL dataset. GSZ outperforms others at bigger ranks. Similar results were obtained with p53 dataset

Comparison with other programs Selected SignalPathway (green line), GSEA (cyan) and GSA (black) to comparison Evaluation was done again using the biologically positive classes Comparing programs less clear (more variables) Here again ALL dataset. Similar results with p53 GSZ outperforms others at large

Summary GSZ, weighted over-representation score Math link to many other popular methods Stable across GO class sizes and across gene list positions Good performance in artificial datasets Best performance with many evaluations from two real datasets

Other applications siRNA data vs. gene IDs (discussed) Linkage data vs. biological processes (discussed) BLAST result list vs. descriptions (in usage) BLAST result list vs. GO classes (in usage)

Warnings Quality of gene expression data Enough samples for permutations Each gene should occur only once in the expression data Filter genes without annotations (with GO data) Use Column Permutations Quality of gene sets / annotations

Wake up!!