Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee.

Similar presentations


Presentation on theme: "A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee."— Presentation transcript:

1 A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee Dannie Durand (chair) Andrew Moore Russell Schwartz Jeffrey Lawrence (Univ. of Pittsburgh, Dept. of Biological Sciences) David Sankoff (Univ. of Ottawa, Dept. of Math & Statistics) I want to start by explaining what I mean by spatial comparative genomics

2 Genome: the complete set of genetic material of an organism or species
Noncoding DNA: Large stretches of DNA with unknown function. CCCCCCGACACTTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCCCCCCCCCC CCCCCCCTGTGAAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGGGGGGGGGG Over time mutations occur When comparing distantly related genomes, we’ll see similarity primarily in sequences that play a functional role, because mutations in those regions are often detrimental, and thus selected against. Thus, rather than comparing the sequence directly, I use an abstraction in which a chromosome is represented simply as a list of genes. A gene then serves as a marker for a particular chromosomal segment. Regulatory regions: Regions of DNA where regulatory proteins bind Genes: DNA sequences that code for a specific functional product, most commonly proteins.

3 Genome Evolution speciation
species 2 species 1 following both speciation and large scale duplication, two genome copies will diverge over time due to a wide range of evolutionary processes acting on the genome at different scales local point mutations and small insertions and delections. Larger scale genome rearrangements, such as transolocations, transpositions, and inversions of large regions of a chromosome, shuffle genes withing and between chromosomes, and scramble gene order with respect to the ancestral genome. Gene duplications and losses also occur. Genomes will start out highly similar, but will evolve independently by both mutation and rearrangement Sequence Mutation + Chromosomal Rearrangements

4 Chromosomal Rearrangements
Species 1 1 2 3 3 4 5 6 7 17 16 19 20 18 8 9 11 12 13 10 14 15 17 16 19 20 18 8 9 11 12 13 10 14 15 4 5 3 1 6 2 4 3 1 2 8 9 7 11 12 10 13 14 15 17 16 19 20 18 13 14 15 17 16 19 18 20 Most people understand how the DNA sequence evolves by local mutation, but are less aware of the many rearrangements that cause changes at the chromosomal level. Duplications Species 2 Inversions Loss

5 My focus: Spatial Comparative Genomics
Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and evolves. In the face of all these different rearrangements, how can we compare genomes, and what can such comparisons tell us about the genome structure, and how it changes and evolves.

6 Terminology Homologous: related through common ancestry
Orthologous: related through speciation Paralogous: related through duplication Species 1 1 2 3 4 5 7 17 16 19 20 18 3 15 14 13 12 11 10 9 8 orthologs 1 1 2 2 3 3 4 5 6 8 9 7 11 12 10 1 2 3 4 13 14 15 17 16 20 Species 2 paralogs

7 An Essential Task for Spatial Comparative Genomics
Identify homologous blocks, chromosomal regions that correspond to the same chromosomal region in an ancestral genome 1 2 3 4 5 7 17 16 19 20 18 3 15 14 13 12 11 10 9 8 1 1 2 2 3 3 4 4 5 6 8 9 7 11 12 10 1 2 3 4 13 14 15 17 16 20 My thesis: how to find and statistically validate homologous blocks

8 More distantly related segments:
However, as the gene order and the gene complement of duplicated regions diverge progressively due to insertions, deletions and rearrangements, it becomes increasingly difficult to distinguish remnants of common ancestral gene order from coincidental similarities in genomic organization. Traces of homology will be evident only as gene clusters… In this context, statistical and algorithmic issues are more challenging, and even how to define what we’re looking for is difficult. Gene Clusters: similar gene content, but neither gene content nor order is strictly conserved

9 Gene Clusters are Used in Many Types of Genomic Analysis
Inferring functional coupling of genes in bacteria (Overbeek et al 1999) Recent polyploidy in Arabidopsis (Blanc et al 2003) Sequence of the human genome (Venter et al 2001) Duplications in Arabidopsis through comparison with rice (Vandepoele et al 2002) Duplications in Eukaryotes (Vision et al 2000) Identification of horizontal transfers (Lawrence and Roth 1996) Evolution of gene order conservation in prokaryotes (Tamames 2001) Ancient yeast duplication (Wolfe and Shields 1997) Genomic duplication during early chordate evolution (McLysaght et al 2002) Comparing rates of rearrangements (Coghlan and Wolfe 2002) Genome rearrangements after duplication in yeast (Seoighe and Wolfe 1998) Operon prediction in newly sequenced bacteria (Chen et al 2004) Breakpoints as phylogenetic features (Blanchette et al 1999) ... highlight some of the variation... in no particular order (but just said we’re not going to require conserved order...) Commonly used definition Here I’ve shown some of the recent genomic analyses that have come out in the past few? Years The ones in yellow all appear to use the max-gap definition Most use monte-carlo (show in color, italics?) Wouldn’t it be nice if there were analytical statistical tests for max-gap clusters?

10 Spatial Comparative Genomics
reconstruct the history of chromosomal rearrangements infer an ancestral genetic map build phylogenies transfer knowledge A common approach to building phylogenetic trees is to analyze a single gene that’s present in all the species of interest. However, the history of a gene may not follow the history of the species, due to horiz. Transfer or duplications. What we’d really like to do is build trees based on evidence from the whole genome. One way to do that is to compare spatial organization. Transfer knowledge from one species to another. Mouse and rat are common model organisms in which much biological experimentation is conducted, whereas obviously there are many exper. We can’t do in human. Transfer information about gene function or interactions among genes or diseases processes from a model organism to other species. Spatial info can help by helping to identifying the corresponding elements, and also when it’s appropriate to transfer certain types of knowledge from one species to another. Very simplistic example is Downs syndrome, which results from a trisomy of chromosome 21, which is scrambled in mouse, vs studying the x chromsome, which has internal rearrangements but it almost entirely conserved. Guillaume Bourque et al. Genome Res. 2004; 14:

11 Spatial Comparative Genomics
Function Snel, Bork, Huynen. PNAS 2002 Consider evolution as an enormous experiment Unimportant structure is randomized or lost Exploit evolutionary patterns to infer functional associations in bacteria, functionally related genes tend to cluster together on the chromosome clusters can be used to infer functional associations between genes If we see conserved structure in distantly related organisms, we can infer that their must be some functional reason that structure was conserved.

12 Outline Introduction and Applications
Formal framework for gene clusters Genome representation Gene homology mapping Cluster definition Introduction to Statistical Issues Preliminary work: Testing cluster significance Proposed work

13 Basic Genome Model a sequence of unique genes
distance between genes is equal to the number of intervening genes gene orientation unknown a single, linear chromosome physical distances often unknown, and differ substantially between organisms. eliminates the need to model gene-rich and gene poor regions differently any type of marker, including hrnrd basic model only relies on type of info available from a linkage map, which except for model species is still the only type of data available for most eukaryotes Extend this model in my proposed work

14 Gene Homology Identification of homologous gene pairs Assumptions
generally based on sequence similarity still an imprecise science preprocessing step Assumptions matches are binary (similarity scores are discarded) each gene is homologous to at most one other gene in the other genome explain color lines here biologically an equivalence relation, but in practice detectable homology is often

15 Where are the gene clusters?
Intuitive notions of what clusters look like Enriched for homologous gene pairs Neither gene content nor order is perfectly preserved Need a more rigorous definition Or a larger, but less dense cluster

16 Cluster Definitions gap= 3 size = 4 Descriptive: length=10
common intervals r-window max-gap Constructive: LineUp CloseUp FISH length=10 Cluster properties order size length density gaps say definitions from literature

17 Max-Gap: a common cluster definition
A set of genes form a max-gap cluster if the gap between adjacent genes is never greater than g on either genome gene rich gene poor 

18 no formal statistical model for max-gap clusters
Why Max-Gap? Allows extensive rearrangement of gene order Allows limited gene insertion and deletions Allows the cluster to grow to its natural size It’s the most widely used in genomic analyses no formal statistical model for max-gap clusters

19 Outline Introduction and Applications
Formal framework for gene clusters Introduction to statistical issues Preliminary work: Testing cluster significance Proposed work Significance of a cluster depends on how it was found

20 Detecting Homologous Chromosomal Segments
Formally define a “gene cluster” Devise an algorithm to identify clusters Verify that clusters indicate common ancestry ...modeling alg to decide where to put boxes stats to determine which are significant Our goal here is to calculate the probability of seeing a gene cluster with certain properties by chance. A region is enriched for paralogous pairs, but it also has some mismatches. As it gets more and more degraded, how can we decide…. I hope to show that studying the statistical issues provide insight into algorithmic and modeling issues as well. ...algorithms ...statistics

21 Statistical Testing Provides Additional Evidence for Common Ancestry
How can we verify that a gene cluster indicates common ancestry? True histories are rarely known Experimental verification is often not possible Rates and patterns of large-scale rearrangement processes are not well understood We have an algorithm, returns clusters, how do we know what we found has any biological significance. In many areas, functional questions, experiments in the lab. Synthetic data. Statistical Testing Provides Additional Evidence for Common Ancestry

22 Statistical Testing Goal: distinguish ancient homologies from chance similarities Hypothesis testing Alternate hypothesis: shared ancestry Null hypothesis: random gene order Determine the probability of seeing a cluster by chance under the null hypothesis An example… If we cannot reject a null hypothesis of random gene order, it’s unlikely we’ll be able to reject any more biologically motivated null hypothesis.

23 Whole Genome Self-Comparison
McLysaght, Hokamp, Wolfe. Nature Genetics, 2002. Compared all human chromosomes to all other chromosome to find gene clusters Identified 96 clusters of size 6 or greater Chromosome 17 10 genes duplicated out of ~100 Notice that like my cartoon it’s rearranged. Two or more clusters of similar genes found in distinct regions on the same genome are often presented as evidence of large scale duplication. Alternate hypothesis: a set of homologous genes arose through a single duplication event Null hypothesis: homologs arose through multiple, independent duplications  homologs are distributed randomly in the genome 29 genes Chromosome 3 Could two regions display this degree of similarity simply by chance?

24 McLysaght, Hokamp, Wolfe. Nature Genetics, 2002.
Chromosome 17 Clusters with similarity to human chromosome 17 Are larger clusters more likely to occur by chance? Are there other duplicated segments that their method did not detect?

25 Cluster Significance: Related Work
Randomization tests most common approach generally compare clusters by size Very simple models Excessively strict simplifying assumptions Overly conservative cluster definitions simple models developed mostly in the context of trying to analyze a particular dataset, weren’t meant to be generally applicable models Citations in proposal

26 Cluster Significance: Related Work
Calabrese et al, 2003 statistics introduced in the context of developing a heuristic search for clusters Durand and Sankoff, 2003 definition: m homologs in a window of size r My thesis max-gap definition Others: Pseudo-Gibbs sampler (Mau et al, 2003?) For a (weighted) sampling of homolog pairs, how often do the genes form a cluster with order conserved? FISH (Calabrese et al, 2004?) heuristic definition, assumes gene family size is binomially distributed (xxx et al, 2005) Conserved intervals (gap = 0)

27 Outline Introduction and Applications
Formal framework for gene clusters Introduction to statistical issues Preliminary work: max-gap cluster statistics reference set whole-genome comparison Proposed work Significance of a cluster depends on how it was found

28 Cluster statistics depend on how the cluster was found
1 2 3 4 5 7 17 16 19 20 18 3 15 14 13 12 11 10 9 8 1 1 2 2 3 3 4 4 5 6 8 9 7 11 12 10 1 2 3 4 13 14 15 17 16 20 Whole genome comparison: find all (maximal) sets of genes that are clustered together in both genomes.

29 Cluster statistics depend on how the cluster was found
Reference set: does a particular set of genes cluster together in one genome? complete cluster: contains all genes in the set incomplete cluster: contains only a subset Do see people presenting this model. Also, it’s mathematically simpler and helpful for more complex WGC case.

30 Preliminary results: Max-Gap Cluster Statistics
Reference set complete clusters complete clusters with length restriction incomplete clusters Whole genome comparison upper bound lower bound Hoberman, Sankoff, and Durand. Journal of Computational Biology 2005. Hoberman and Durand. RECOMB Comparative Genomics 2005. Hoberman, Sankoff, and Durand. RECOMB Comparative Genomics 2004.

31 Reference set, complete clusters
Given: a genome: G = 1, …, n unique genes a set of m genes of interest (in blue) m = 5 Do all m blue genes form a significant cluster?

32 Reference set, complete clusters
g = 2 m = 5 Test statistic: the maximum gap observed between adjacent blue genes P-value: the probability of observing a maximum gap ≤ g, under the null hypothesis

33 Compute probabilities by counting
All possible unlabeled permutations The problem is how to count this Permutations where the maximum gap ≤ g conceptual space of ways of positioning yellow and blue dots basic approach is counting things in event space of all possible ways of organizing, how many of them have m genes clustered with a gap no bigger than m, p-value

34 number of ways to start a cluster, e. g
number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left w = (m-1)g + m

35 ways to place the remaining m-1 blue genes, so that no gap exceeds g
number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left ways to place the remaining m-1 blue genes, so that no gap exceeds g g

36 ways to place the remaining m-1 blue genes, so that no gap exceeds g
number of ways to start a cluster, e.g. ways to place the first gene and still have w-1 slots left ways to place the remaining m-1 blue genes, so that no gap exceeds g edge effects The number of ways of adding gaps between any two adjacent genes is g+1, so the total number of ways of adding gaps between m genes is… w = (m-1)g + m

37 Counting clusters at the end of the genome
Gaps are constrained: And sum of gaps is constrained: l = w-1 l = m

38 A known solution: l < w g1 g2 g3 gm-1
number of ways of packing the window is the same as dice, well known and it’s this lovely equation

39 Counting clusters at the end of the genome
Gaps are constrained: And sum of gaps is constrained: l = w-1 l = m

40 Cluster Length l = w-1 l = m … m m+1 w-2 w-1 w Line of Symmetry
d(m,g,m) + d(m,g,m+1) … d(m,g,w-2) + d(m,g,w-1) l = w-1 d(m,g,m) + d(m,g,m+1) … d(m,g,w-2) d(m,g,m) + d(m,g,m+1) + + … d(m,g,m) d(m,g,m+1) + d(m,g,m) d(m,g,m+1) + d(m,g,m) l = m Line of Symmetry

41 Exploiting Symmetry = = = l= w m m+1 w-1 m+2 w-2 g g g 1 g-1 1 g-1 g 2

42 Cluster Length (g+1)m-1 l = w-1 (g+1)m-1 (g+1)m-1 l = m … m m+1 w-2
d(m,g,m) + d(m,g,m+1) … d(m,g,w-2) + d(m,g,w-1) (g+1)m-1 l = w-1 d(m,g,m) + d(m,g,m+1) … d(m,g,w-2) (g+1)m-1 d(m,g,m) + d(m,g,m+1) + (g+1)m-1 + … d(m,g,m) d(m,g,m+1) + d(m,g,m) d(m,g,m+1) + d(m,g,m) l = m

43 Adding edge effects… Starting Starting Ways to place positions
near end Starting positions Ways to place remaining m-1 Hoberman, Durand, Sankoff. Journal of Computational Biology 2005.

44 Probability of a complete cluster
n = 500

45 Using statistics to choose parameter values
Significant Parameter Values (α = 0.001) n = 500 Treacherous waters (white/dragon), significant parameter space (black) you want to design an experiment to find clusters say what gap size should we choose? e.g. if only 5 genes are of interest, then gap size must be less than

46 Preliminary Results: Max-Gap Cluster Statistics
Reference set complete clusters complete clusters with length restriction incomplete clusters Whole genome comparison upper and lower bounds Hoberman, Sankoff, Durand. Journal of Computational Biology 2005. Hoberman and Durand. RECOMB Comparative Genomics 2005. Hoberman, Sankoff, Durand. RECOMB Comparative Genomics 2004.

47 Whole genome comparison
A surprising result If gene content is identical, the probability of a max-gap cluster is 1 (regardless of the allowed gap size) In

48 Whole Genome Comparison: m ≤ n
Two genomes of n genes with with m homologous genes pairs g 3 g 3 What is the probability of observing a maximal max-gap cluster of size exactly h, if the genes in both genomes are randomly ordered? A cluster is maximal if it is not a subset of a larger cluster consider all permutations of random gene order h genes that form a maximal max-gap cluster in both genomes, all genes are of interest here really incomplete is really question of interest given 2 genomes, want to find all clusters of at least size k want to find all signif clusters in whole genome comparison find them, all different sizes. significant? what’s a question you might ask i say in both genomes make all genome comparison dots bigger

49 A constructive approach
All configurations of two genomes Configurations that contain a cluster of exactly size h Enumerate all permutations that do not contain any clusters of size h or larger ??

50 Constructive Approach
Number of configurations that contain a cluster of exactly size h number of ways to place m-h remaining genes so they do not extend the cluster number of ways to place h genes so they form a cluster in both genomes not enumerating literally consider how many ways to

51 A tricky case… gap > 1 g = 1 h = 3 gap > 1 Where can we place the pink and green genes so that they do not extend this cluster of size three? With this placement, the cluster cannot be extended lot of similarities between alg to find the clusters and how we do counting this particular case there’s this catch-22 that causes problems both in alg and stats. more difficult for stat? top-down doesn’t work? they also comment on this, they were also able to solve using top-down approach, I can’t do that look at this interesting example. this is why can’t use greedy approach before placed white genes around me, alg not always used in practice: in at least one case a greedy heuristic was used instead let’s try a different approach, address this clusters are not “nested”: a cluster of size k doesn’t necessarily contain a cluster of size k-1

52 A tricky case… gap = 1 g = 1 h = 3 gap = 1 Moving genes further away from the cluster may make them more likely to extend the cluster lot of similarities between alg to find the clusters and how we do counting this particular case there’s this catch-22 that causes problems both in alg and stats. more difficult for stat? top-down doesn’t work? they also comment on this, they were also able to solve using top-down approach, I can’t do that look at this interesting example. this is why can’t use greedy approach before placed white genes around me, alg not always used in practice: in at least one case a greedy heuristic was used instead let’s try a different approach, address this clusters are not “nested”: a cluster of size k doesn’t necessarily contain a cluster of size k-1

53 My whole-genome comparison results
I derived upper and lower bounds on the probability of observing a cluster containing h homologs, via whole genome comparison Lower bound: guarantees no tricky cases Upper bound: a few tricky cases sneak in Hoberman, Sankoff, Durand. Journal of Computational Biology 2005.

54 Whole-genome comparison cluster statistics
n=1000, m=250 g=10 g=20 Cluster Probability When g=10, we see just what we expect. As the cluster size increases, the prob of finding a cluster of that size decreases, given random gene order. And this trend continues even for larger clusters. Interestingly, we see different behavior when g=20. Bounds aren’t as tight for middle cluster sizes, which is unfortunate, but what the bounds do show us is that, like before, we see nonmonotonic probabilities, which is disturbing. Intuitively, researchers seek a cluster definition that guarantees that larger clusters (containing more matching genes) are always more significant, which suggests that the max-gap definition is less desirable than it may at first seem. Cluster size

55 E. coli vs B. Subtilis clusters above the orange line
Algorithm: Bergeron et al, 02 Statistics: Hoberman et al, 05 Complete cluster doesn’t form until g=110 Typical operon sizes I stopped the orange line at g=20 because null model predicts by that point should see only one large cluster. Gaps within operons are small, so gaps between operons must be larger. clusters above the orange line are significant at the .001 level Under null hypothesis, by g=25 all genes should form a single cluster

56 Summary of preliminary work
Developed statistical tests using a combinatoric approach reference region whole genome comparison Some surprising results Results raise concerns about current methods used in comparative genomics studies

57 Larger clusters do not always imply greater significance
A max-gap cluster containing many genes may be more likely to occur by chance than one containing few genes Of concern because randomization studies use cluster size as test statistic. In traditional hypothesis testing, the probability of observing value of the test statistic always decreases as the text statistic becomes more extreme. Suggests that this is bad test statistic, or perhaps that using only a gap constraint but not evaluating the total cluster density is problematic.

58 Algorithms and Definition Mismatch
Greedy, bottom-up algorithms will not find all max-gap clusters There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002) May miss many clusters Could draw incorrect conclusions about whether disordered clusters exist There’s not standard software to find these, and many groups state informally that gap small, and don’t describe their method in detail…. example of challenges for bioinformatics community, presented in algorithms workshop

59 Extending the Model Directions for generalization Circular chromosomes
Multiple chromosomes Genome self-comparison Gene order and orientation Gene families different species

60 Outline Introduction and Applications
Formal framework for gene clusters An introducton to statistical issues Preliminary work: Testing cluster significance Proposed work

61 Proposed Work Outline Generalizing the model
At least one of the following: Joint detection of orthologous genes and chromosomal regions Finding and assessing clusters in multiple genomes Detecting selection for spatial organization Validation

62 Joint Identification of Orthologous Genes and Chromosomal Regions
The identification of orthologous genes is a prerequisite for a marker-based approach Orthology identification is often difficult to determine from gene sequence alone is an important unsolved research problem can be improved by incorporating genomic context Of interest in its own right

63 An example: Which gene is the true ortholog?
Most similar Least similar 1st of 4 Species 2 2nd of 4 1st of 1 3rd of 4 1st of 1 1st of 1 1st of 1 1st of 1 4th of 4 Query Gene Species 1

64 Solution: Identify orthologs and gene clusters simultaneously
Problem: for more diverged genomes, unambiguous orthologs will be sparse and clusters will be more rearranged Solution: Identify orthologs and gene clusters simultaneously Identify homologous genes Similar genomic context Find gene clusters

65 Work that combines sequence similarity and genomic context
Bansal, Bioinformatics 99 Kellis et al, J Comp Biol 04 Bourque et al, RECOMB Comp Genomics 05 Chen et al, ACM/IEEE Trans Comput Biol and Bioinf 05 Limitations No flexible cluster definitions No statistical approaches Little real evaluation others

66 Possible computational approaches:
Expectation Maximization (EM) treat ortholog assignment as a hidden variable Maximal bipartite matching use an objective function that incorporates both sequence similarity and spatial clustering

67 Proposed Work Generalizing the model At least one of the following:
Joint detection of orthologous genes and chromosomal regions Finding and assessing clusters in multiple genomes Detecting selection for spatial organization Validation

68 Comparing Multiple Genomes Simultaneously
Comparison of multiple genomes offers significantly more power to detect highly diverged homologous segments Arabidopsis thaliana Homologous regions may share few or no genes explain venn diagram! Rice Arabidopsis thaliana Vandepoele et al, 2002

69 Current Approaches Identify clusters based on conserved pairs of genes, using heuristics Limitation: A highly rearranged cluster may have no pairs in proximity

70 Current Approaches Identify clusters with conserved gene order,
ordered (Vandepoele et al, TIG 02; Wolf et al, Genome Res 01 Limitation: rearranged clusters will not be detected

71 Current Approaches Will lead to a reduction
Search for max-gap clusters, but require the cluster to be found in its entirety in all genomes Will lead to a reduction in power as more genomes are added max gap (Beal et al, Theoret. Comput. Sci. 04; Chen, NAR 04) …No formal statistics

72 Initial Investigations
Modeling: Maximum gap between genes with a match in any of the regions must be small Algorithms: how to find such clusters Statistics: choice of test statistic; i.e., how to weight genes that occur in only a subset of the regions

73 Proposed Work Generalizing the model At least one of the following:
Joint detection of orthologous genes and chromosomal regions Finding and assessing clusters in multiple genomes Detecting selection for spatial organization Validation

74 Tests for Selective Pressure on Spatial Organization
Preliminary work: Null hypothesis: random gene order Alternate hypothesis: common ancestry Proposed work: Null hypothesis: common ancestry Alternate hypothesis: functional selection In bacteria, conservation of spatial organization may indicate selective pressure to maintain gene proximity How do we know whether grouped because of… Probability of finding a cluster under the null hypothesis now depends on the phylogenetic distance between the species

75 Tests for selective pressure must consider phylogenetic distance
E. coli Salmonella Quite likely to occur by chance. Haemophilus influenzae B. subtilis Less likely to occur by chance.

76 Current Approaches Discard closely related genomes, and test against random gene order E. coli Salmonella Haemophilus influenzae Comparative genomics approaches for detection of functional clusters is an active area of research Limitations Do not analyze clusters of multiple genes Do not take phylogenetic distance between species into account Rarely statistical in nature B. subtilis

77 Current Approaches Some formal statistical tests, but based on gene pairs only. Limitation: considering only pairs of genes could result in a loss of power Comparative genomics approaches for detection of functional clusters is an active area of research Limitations Do not analyze clusters of multiple genes Do not take phylogenetic distance between species into account Rarely statistical in nature

78 Detecting Selective Pressure on Spatial Organization Initial Explorations
Searching for evidence of selective pressure to maintain non-operon structure in bacteria Locations of clusters with respect to origin and terminus left and right arm of chromosome functional classification

79 Proposed Work Generalizing the model At least one of the following:
Joint detection of orthologous genes and chromosomal regions Finding and assessing clusters in multiple genomes Detecting selection for spatial organization Validation

80 How Should Gene Cluster Statistics be Validated?
No established benchmarks True evolutionary histories are rarely known Rearrangement processes are not yet understood We’d like to evaluate Discriminatory power Parameter selection strategies Possible strategies depend on specific problem Synthetic data Hand-curated ortholog databases Databases of experimentally verified operons

81 Initial Investigations
timeline S 9 O 10 N 11 D 12 J 1 F 2 M 3 A 4 5 6 7 8 2005 2006 Loose ends Model Extensions Selected Problem(s) And Validation Initial Investigations Writing

82 Acknowledgements My Thesis Committee
Barbara Lazarus Fellowship The Sloan Foundation The Durand Lab

83 Advantages of an analytical approach
Analyzing incomplete datasets Principled parameter selection Efficiency Understanding statistical trends Insight into tradeoffs between definitions

84 The Max-Gap Definition is the Most Widely Used in Genomic Analyses
Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features ...


Download ppt "A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee."

Similar presentations


Ads by Google