Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.

Similar presentations


Presentation on theme: "Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi."— Presentation transcript:

1 Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi

2 What, Why, How… Gene expression data/analysis Problems with gene expression data analysis Earlier solutions My solution Comparisons Conclusions / Warnings

3 Genome-wide gene expression Genome-wide Gene Expression (GE) analysis. Standard lab tool Various methods Aim to understand biological differences across the samples at gene level If you don’t work with GE data: – Gene Set Methods can be used with most other large scale data sets

4 Typical pipelines Generate the GE data Pre-processing (Normalization etc.) Define Differentially Expressed genes Draw biological conclusions Find over-represented biological processes Generate the GE data Pre-processing (Normalization etc.) Define Differentially Expressed genes Cluster selected genes Draw biological conclusions Generate the GE data Pre-processing (Normalization etc.) Define Differentially Expressed genes Generate a classification of samples using GE profiles of genes Draw biological conclusions Classify unknown samples

5 What can go wrong? Is the definition of Differentially Expressed genes always reasonable? – datasets with large noise levels – p-value thresholds – sudden jump to signif. regulation – genes with weak regulation Is the set of Diff. Expr. genes the main goal?

6 What can go wrong? Is the definition of Differentially Expressed genes always reasonable? – datasets with large noise levels – p-value thresholds – genes with weak regulation Is the set of Diff. Expr. genes the main goal? =>Biological Processes are usually more informative.

7 What can go wrong? Analysis of data with one threshold. Biological process with weak regulation goes unnoticed

8 Solution Analyze sets of genes instead of genes Gene Set: Genes belonging to same pathway, biological process, complex and/or Gene Ontology class Benefits: Group of genes is less sensitive to error than a single gene* Benefits: Easy interpretation of the results Something to support the gene based analysis

9 Gene set analysis pipeline Generate the GE data Pre-processing (Normalization etc.) Define continuous Diff. Expr. score for genes Calculate a gene set score for each gene set Generate permuted data Pre-defined gene sets Calculate the gene set score for each gene set Look for gene sets that show stronger signal in real data than in permuted data Gene level Gene set level Expression data Class data Sample labels

10 Methods for gene set scoring Average based methods Rank based methods Other methods (omitted here)

11 Average based methods Calculate the average regulation of gene set (Tian et al. PNAS) Can something go wrong with it?

12 Rank based methods Steps: – order genes with differential expression – test every possible threshold in the ordered list – look over(/under)-representation of gene set above the threshold – select the strongest score Expression values are (often) discarded! Iterative Group Analysis, Kolmogorov-Smirnov test (KS), modified KS (Gene Set Enrichment Analysis package, MIT) Analyzed subset threshold Gene expression dataAnalyzed gene classes Black = class member White = not a member

13 Permutations Needed to evaluate significance Two types: Row Randomization – mix labels gene set / gene class Column Randomization – mix sample labels, used to calculate diff. expr. Column Randomization preferred Row rand. Col. rand

14 Summary of methods Average-based methods are weak with non- coherent regulation Rank-based methods usually omit gene expression values => steps between all genes equally significant

15 My brilliant proposal Combine two method groups: – Order genes with diff. expr. scores – Test every threshold position – At each threshold calculate Scale the difference with STD and average estimates (Toronen et al. 2009) Get a Z-score scaling for difference => Gene Set Z-score (GSZ)

16 My brilliant proposal An over-representation (hypergeometric) score weighted with diff. expr. score GSZ compares the Diff to the mean and STD we obtain when the class is randomly distr. in the ordered list. Considers both: Variance in the expr. values and variance in the number gene set members in the list

17 My brilliant proposal Many popular Gene Set scoring methods are variants of GSZ-method: – hypergeometric testing – Pearson correlation – Max-Mean (Efron, Tibshirani) – Random Sets (Newton et al.)

18 GSZ profile from ALL data (Chiaretti et.al) for one GO class vs. 7 quantiles (0, 5, 25, 50, 75, 95, 100) from 500 permutations. Different positions corresponds to other competing methods.

19 Evaluation Stability of the scores as threshold goes through the gene list? Red line: Strongest signal from positive data (across all GO classes) Blue lines: various quantiles (same as before) across all GO class Compare with KS and modified KS (Right column. MIT, PNAS and Nature Gen.) Same data, same permutation!! GSZ with diff. parameter values. Third box shows default parameter values. Pay attention to stability of blue lines.

20 More evaluation GSZ is also stable against the gene set size variations – most methods are not Several Gene Set scoring methods were tested with artificial positive and random datasets – GSZ showed best overall ability to separate two dataset types Methods were evaluated by splitting the real data to two halves: Test how well the results match – GSZ was best in predicting its own results from the other half – GSZ was best in predicting summary of all methods from the other half

21 More evaluation Compare different gene set scoring functions Test with two popular datasets against GO classes Calculate the empirical -log(p-values) for strongest GO classes from each method Blue line = GSZ, green line = T-test, red = KS, magenta = iGA, cyan = modified KS ALL dataset p53 dataset Pooled data Class data

22 More evaluation Select biologically relevant GO classes as biologically positive Look how many such classes each method finds across the top ranks (GSZ = blue line) Here ALL dataset. GSZ outperforms others at bigger ranks. Similar results were obtained with p53 dataset

23 Comparison with other programs Selected SignalPathway (green line), GSEA (cyan) and GSA (black) to comparison Evaluation was done again using the biologically positive classes Comparing programs less clear (more variables) Here again ALL dataset. Similar results with p53 GSZ outperforms others at large

24 Summary GSZ, weighted over-representation score Math link to many other popular methods Stable across GO class sizes and across gene list positions Good performance in artificial datasets Best performance with many evaluations from two real datasets

25 Other applications siRNA data vs. gene IDs (discussed) Linkage data vs. biological processes (discussed) BLAST result list vs. descriptions (in usage) BLAST result list vs. GO classes (in usage)

26 Warnings Quality of gene expression data Enough samples for permutations Each gene should occur only once in the expression data Filter genes without annotations (with GO data) Use Column Permutations Quality of gene sets / annotations

27 Wake up!!


Download ppt "Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi."

Similar presentations


Ads by Google