Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang
Overview Microarrays and the Gene Ontology (GO) database Scoring differential gene- expression in GO groups Checking scores against different null hypothesises Sample data (two types of Breast Cancer) and results
Overview Microarrays and the Gene Ontology (GO) database Scoring differential gene- expression in GO groups Checking scores against different null hypothesises Sample data (two types of Breast Cancer) and results
Microarrays: sample scheme A B C D Genes mRNA B C Transcription Differential Gene Expression RNA-Isolation and synthesis of cDNA with labeled Nucleotides (reverse Transcription) B C labeled cDNA Hybridisation AB DC Fluorescense indicates that gene B and gene C are transcribed
Microarrays: comparative analysis sample tissue I 1,2,... tissue II 1,2,... gene 1meanmean => t-value gene 2meanmean => t-value gene 3meanmean => t-value... ranking ?
How to interprete the data? Long list of siginficant genes Which genes are of interest? Solution: pooling of genes into functional classes provides a general overview Gene Ontology database provides such a functional classification
The Gene Ontology database
GO is a database of terms for genes Known genes are annotated to the terms Terms are connected as a directed acyclic graph Levels represent specifity of the terms
The Gene Ontology database Apoptotic protease activator Gene OntologyApoptosis regulatorEnzyme activatorApoptosis activatorProtease activatorMolecular function
The Gene Ontology database Every child-term is a member of its parent-term GO contains three different sub- ontologies: Molecular function Biological process Cellular component Unique identfier for every term: GO: (root=Gene Ontology)
Gene Ontology and microarrays Hypothesis: Functionally related, differentially expressed genes should accumulate in the corresponding GO-group. Problem: Find a method, which scores accumulation of differential gene expression in a node of the Gene Ontology.
Gene Ontology and microarrays tissue type 1 2 GO:2 GO:3 GO:4 samples genes GO:1 P-value for every gene by a two-sample t-test
Overview Microarrays and the Gene Ontology (GO) database Scoring differential gene- expression in GO groups Checking scores against different null hypothesises Sample data (two types of Breast Cancer) and results
GO: Scoring methods Number of significant genes in a GO- group Sum of negative logarithms of all p- values sup|P (n) -F (n) | according to Kolmogorov- Smirnov p-value Σ 1, 2, 3,... -log P ?
The p-value cdf: cummulative distribution function t t p = cdf t>0 => p = 1-cdf => p(0, 0.5] m(0, 1] m=2*p
Sum of log-score Pavalidis, Lewis, Noble 2001; Zien, Küffner, Zimmer, Lengauer *p -> 1 => -log(2*p) -> 0 Small p-values, high score
Kolmogorov-Smirnov-Score empirical theoretical Hypothesis: the calculated p-values (multiplied by 2) are equally distributed between 0 and 1. 0 x x x x x xx xx x x x x 1 0 n 1 0 xxxx xx x x x x 1 0 n 1 S=sup|P (n) -F (n) | P (n) : p-values for genes that fall into a GO-group. F (n) : equally distributed values between 0 and 1.
Overview Microarrays and the Gene Ontology (GO) database Scoring differential gene- expression in GO groups Checking scores against different null hypothesises Sample data (two types of Breast Cancer) and results
Null hypothesises The significant genes (according to Bonferoni: α=0.05/n) are distributed over the GO-groups by chance The existing differential gene expression is distributed over the GO-groups by chance There is no differential gene expression in a GO-group
Checking H 0 by permutation samples genes Permutation of rows Mapping of p-values into GO-groups is randomized. H 0 : Distribution of differential gene expression Permutation of columns Level of p-values is randomized. H 0 : No differential gene expression in a GO-group
Checking H 0 by permutation 1000 random permutations => background distributions H 0 : Distr. of significant genes Randomizing GO-groups (rows) H 0 : Distr. of all p-values Randomizing GO-groups (rows) H 0 : Level of p-values Permutation of columns
Methods (summary) Data P-values Number of significant genes Sum of –log Psup|P (n) -F (n) | Check against 1000 permutations of rows (GO-groups) Check against 1000 permutations of columns (samples => level of p-values)
Overview Microarrays and the Gene Ontology (GO) database Scoring differential gene- expression in GO groups Checking scores against different null hypothesises Sample data (two types of Breast Cancer) and results
Results: Data (Breast Cancer) Two major subclasses Estrogen receptor postive (ER+) Estrogen receptor negative (ER-) Estrogen receptor postive Succeptible to Tamoxifen Slightly better survival rate Great molecular differences between the two types
Results: Data (Breast Cancer) Data: 25 ER+, 24 ER- Array: Affymetrix HuGeneFL ~ 7000 Genes ~ 4000 annotated to GO-terms Data were normalized by variance stabilization (Heydebreck et. al 2001)
Results: Pre-conditions GO-group considered to be significant if less than 5% of the random permutations exceeds the score Only GO-groups with more than 5 and less than 1000 genes were taken into account
Results: Number of significant genes According to the pre-conditions 16 GO-groups were found
Results: Permutation of rows (distribution hypothesis) Sum of –log PKolmogorov-Smirnov
Results: Permutation of columns (differential gene-expression hypothesis) Sum of –log PKolmogorov-Smirnov
Results The column-permutation leads to a very low background distribution Many „significant“ GO-groups May help to find functional groups without differential gene- expression Different scoring methods seem to be complementary as indicated by the results of the row-permutation
Results: Permutation of the rows Sum of log: 44 GO-groups were found (5% cond.,...) KS-score: 77 GO-groups were found (5% cond.,...) GO: M-Phase of mitotic cell-cycle (37 genes)
Results: Comparing the scoring- methods (from the row-permutation) A: 16 B: 77 C: 43 A and B: 3 A and C: 13 C and B: 13 A, B and C: 3 C without A: 30 B without A: 74 C B A A: counting of significant genes in GO-groups B: Kolomogorov-Smirnov C: sum of logarithms
Browsing the results
Results: Interesting GO-term (M-Phase) Contains a couple of interesting proliferative genes (p-value ~5*10 -4 => „not significant“) E.g.: polo-like kinase t-value: -3.45; p-value: 5.59*10 -4 would not been found by a single- gene approach correlation with ER-Receptor could be found in literature (Wolf et al, 2000)
Summary/ outlook GO provides a general view on large-scale gene- expression data Less deregulated but very interesting genes could be found Third null hypothesis => differential gene expression over a wide range of genes (outlook: which GO-groups contain no differential gene- expression) No bias of scores by top-level genes (outlook: leaving out top-level genes for scoring) Possible modification of scoring-methods: up- and downregulation