Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.

Similar presentations


Presentation on theme: "The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand."— Presentation transcript:

1 The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

2 Glycolysis Pathway Glycolysis Clusters Clostridium acetobutylicum

3 Gene Clustering for Functional Inference in Bacterial Genomes The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999.

4 Gene content and order are preserved rearrangement, mutation Similarity in gene content Neither content nor order is strictly preserved large scale duplication or speciation event original genome

5 “Evolution of gene order conservation in prokaryotes” Tamames, Genome Biology 2, 2001

6 “Evolution of gene order conservation in prokaryotes” Tamames, Genome Biology 2, 2001 Gene insertion/loss

7 “Evolution of gene order conservation in prokaryotes” Tamames, Genome Biology 2, 2001 Gene insertion/loss Local rearrangement

8 Two Possible Questions 1.Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance 2.Identify all significantly conserved gene clusters as a starting point for making functional inferences

9 Two Possible Questions 1.Given a set of genes that we believe are functionally related, determine if they cluster together spatially more than we would expect by chance 2.Identify all significantly conserved gene clusters as a starting point for making functional inferences Reference set scenario Whole genome comparison

10 Reference Set Scenario

11 Model of a genome –G = 1, …, n; an ordered set of n unique genes –assume genes do not overlap –chromosome breaks ignored

12 Model of a genome –G = 1, …, n; an ordered set of n unique genes –assume genes do not overlap –chromosome breaks ignored Reference gene scenario: –m genes of interest (in red) are pre-specified –want to find clusters of (a subset of) these genes Reference Set Scenario

13 Given: two genomes: G = 1, …, n and H = 1, …, n Find all significant clusters of at least k homologs in close proximity in both genomes? Whole Genome Scenario G H

14 Outline What formalisms do we need to address these questions? –Definitions: formulate a cluster definition –Algorithms: identifying clusters in real data  Statistics: assess the significance of one or more clusters Reference set scenario Whole genome comparison Conclusion

15 Why develop a formal statistical model? Understand trends and verify that they match our expectations Choose parameters effectively Statistical tests for data analysis Typically researchers use randomization tests to estimate statistical significance

16 Cluster Definitions An intuitive notion of a cluster is a group of genes –occurring in close proximity –neither gene content nor order is strictly conserved Algorithms and statistics require a formal definition. –What properties are desirable? –Do existing definitions have these properties?

17 Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 size = 3 genes

18 Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 length = 6

19 Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 length = 6

20 Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 –density: proportion of red genes (size/length) Example: density ≥ 0.5 density = 6/11

21 Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 –density: proportion of red genes (size/length) Example: density ≥ 0.5 density = 6/11

22 Possible Cluster Parameters –size: number of red genes in the cluster Example: cluster size ≥ 3 –length: number of genes between first and last red genes Example: cluster length ≤ 6 –density: proportion of red genes (size/length) –compactness: maximum gap between adjacent red genes gap ≤ 4 genes

23 Max-Gap Cluster Commonly used in analysis of genomic data Desirable properties –Ensures minimum local density –Extensible: doesn’t artificially limit cluster length –Disjoint: clusters will not overlap gap  g

24 Outline Formalisms Reference set scenario Whole genome comparison Conclusion

25 Formalisms Definitions: formulate a cluster definition Algorithms: identify clusters in real data Statistics: assess the significance of a cluster

26 A Statistical Model Given –a genome: G = 1, …, n unique genes –a set of m reference genes –a maximum-gap size g Null hypothesis: –Random gene order Alternate hypotheses: –Evolutionary history –Functional selection

27 We provide –analytical and dynamic programming solutions –to determine cluster significance exactly –for the reference set scenario Hoberman, Sankoff and Durand. In ``Proceedings of the RECOMB Satellite Workshop on Comparative Genomics'', J. Lagergren, ed., Lecture Notes in Bioinformatics, Springer Verlag, in press. Hoberman, Sankoff, Durand. Submitted to RECOMB 2005. Statistics of Max-Gap Gene Clusters

28 Test Statistic: Complete Clusters The probability of observing all m reference genes in a max-gap cluster in G

29 Test Statistic: Incomplete Clusters The probability of observing at least h of the m reference genes in a max-gap cluster in G

30 Cluster significance n = 1000, m=50 n = number genes in each genome m = number of genes shared between the two genomes g = maximum allowed gap size h = size of cluster (e.g. number of red genes) n = 500, h = m/2

31 Significant Parameter Values (α = 0.0001) n = 500

32 Significant Parameter Values (α = 0.0001) n = 500

33 Outline Formalisms Reference set scenario Whole genome comparison Conclusion

34 Formalisms Definitions: formulate a cluster definition Algorithms: identify clusters in real data Statistics: assess the significance of one or more clusters

35 Whole genome comparison Find all sets of genes that form max-gap clusters in both genomes. g  10

36 Properties of Max-Gap Clusters for Whole Genome Comparison Clusters are locally dense in both genomes Clusters are still guaranteed to be disjoint. The definition is symmetric with respect to genome Most existing cluster algorithms are not symmetric!

37 If g = 2 There is no valid max-gap cluster of size two or three There is a valid max-gap cluster of size four Algorithms: Finding Max-Gap Clusters

38 A consequence of this is that a greedy iterative approach will not find all max-gap clusters –Specifically, larger clusters that don’t contain smaller ones will not be found Algorithms: Finding Max-Gap Clusters

39 There is an efficient divide-and-conquer algorithm to find all max-gap clusters (Bergeron et al, 2002) Since algorithms are generally not stated formally in application papers, we don’t know whether people are actually getting what they think they’re getting Algorithms: Finding Max-Gap Clusters

40 Formalisms Definitions: formulate a cluster definition Algorithms: identify clusters in real data Statistics: assess the significance of one or more clusters Work in Progress…

41 Statistics: Whole genome comparison What is the probability that at least k genes form a max-gap cluster in both genomes? g  10

42 What is the probability that at least k genes form a max-gap cluster in both genomes? Assuming identical gene content, the probability of finding a max-gap cluster of size at least k is always one! g  10 Statistics: Whole genome comparison

43 An Example Example: g =1

44 An Example

45 Example: g =1 An Example A cluster of size k does not necessarily contain a cluster of size k-1

46 Example: g =1 An Example

47 When gene content is identical, there will always be a cluster of size n Example: g =1 An Example

48 When gene content is identical, there will always be a cluster of size n Therefore, for all k, there will always be a cluster of size at least k Example: g =1 An Example

49 When gene content is identical, there will always be a cluster of size n Therefore, for all k, there will always be a cluster of size at least k Therefore, the probability of finding a cluster of size at least k is always one! Example: g =1 An Example

50 Relaxing the Assumption of Identical Gene Content Assume only m of the n genes in each genome are shared If the longest run of “non-shared” genes is less than g then we are still guaranteed to find a complete cluster

51 More generally… Simulations of randomly ordered genomes show that large clusters may be very likely to occur merely by chance

52 Unexpected Statistical Trends There can be a significant probability of finding a cluster that includes all homologous gene pairs The significance of a cluster of size k can be less than that of a cluster of size k-1 Probabilities are not monotonic Large clusters may not be significant n = 1000, m = 250, g=20 Probability of a cluster of size 250 ~ 50%

53 Outline Formalisms Reference set scenario Whole genome comparison Conclusion

54 Clusters Are Used in Many Other Applications Inferring functional coupling of genes in bacteria (Overbeek et al 1999) Recent polyploidy in Arabidopsis (Blanc et al 2003) S equence of the human genome ( Venter et al 2001) Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002) Duplications in Eukaryotes (Vision et al 2000) Identification of horizontal transfers (Lawrence and Roth 1996) Evolution of gene order conservation in prokaryotes (Tamames 2001) Ancient yeast duplication (Wolfe and Shields 1997) Genomic duplication during early chordate evolution (McLysaght et al 2002) Comparing rates of rearrangements (Coghlan and Wolfe 2002) Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998) Operon prediction in newly sequenced bacteria (Chen et al 2004) Breakpoints as phylogenetic features (Blanchette et al 1999)...

55 Max-Gap Clusters are Especially Common Inferring functional coupling of genes in bacteria (Overbeek et al 1999) Recent polyploidy in Arabidopsis (Blanc et al 2003) S equence of the human genome ( Venter et al 2001) Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002) Duplications in Eukaryotes (Vision et al 2000) Identification of horizontal transfers (Lawrence and Roth 1996) Evolution of gene order conservation in prokaryotes (Tamames 2001) Ancient yeast duplication (Wolfe and Shields 1997) Genomic duplication during early chordate evolution (McLysaght et al 2002) Comparing rates of rearrangements (Coghlan and Wolfe 2002) Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998) Operon prediction in newly sequenced bacteria (Chen et al 2004) Breakpoints as phylogenetic features (Blanchette et al 1999)...

56 Formal statistical models allow us to –understand trends and verify that they match our expectations, –choose parameters effectively –conduct statistical tests for data analysis Formal statistical models require –a formal cluster definition –a search procedure to find clusters These issues are more complicated than they might seem!

57 Summary Results: statistical tests of significance for max-gap clusters Reference set scenario Genome comparison (work in progress) We need to explicitly consider the cluster properties we would like our definitions to satisfy rigorously evaluate whether our definition meets these requirements carefully prove that our search procedures match our stated definitions

58 Thank You


Download ppt "The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand."

Similar presentations


Ads by Google