How to Define a Cluster? constructive or algorithmic definition

How to Define a Cluster? constructive or algorithmic definition
e.g. greedy agglomerative clustering: merge anything that’s within a mean distance of g difficult to assess the characteristics of the resulting clusters, so hard to ensure that the clusters have the properties we desire e.g. most definitions are not disjoint, symmetric difficult to reason about statistical behavior of clusters mathematical definition e.g. a set of at least k red genes such that the distance between adjacent red genes is never more than g easier to reason about property characteristics, but still difficult to select a definition to capture all desired properties easier to reason about statistical behaviors but a “natural” definition may still have unexpected statistical properties need to devise a separate search procedure to find all such clusters and need to verify that the search is complete

Directions for Future Work
Statistics for more complex models Gene families Multiple genomes Design of more sensitive cluster definitions to better capture the properties of homologous regions to better discriminate true homologous regions from chance similarities

Significant Parameter Values (α = 0.0001)
you want to design an experiment to find clusters (plural?) say what gap size should we choose? e.g. if only 5 genes are of interest, then gap size must be less than

ways to choose m-1 gaps between 0 and g
Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that they form a max-gap cluster? Example: m = 5 all gaps zero all gaps g g g g g ways to choose m-1 gaps between 0 and g

How many ways can we form a max-gap cluster of size m with length = l ?
gm-1 l < w

Number of ways of choosing m-1 gaps between 0 and g so sum = l-m
How many ways can we form a max-gap cluster of size m with length = l ? g1 g2 g3 gm-1 l < w Number of ways of choosing m-1 gaps between 0 and g so sum = l-m number of ways of packing the window is the same as dice, well known and this is it! = Number of ways of rolling m-1 dice with faces labeled 0 to g so sum of faces is l-m

Number of ways of choosing m-1 gaps between 0 and g so sum = l-m
How many ways can we form a max-gap cluster of size m with length = l ? g1 g2 g3 gm-1 Number of ways of choosing m-1 gaps between 0 and g so sum = l-m number of ways of packing the window is the same as dice, well known and this is it! = Number of ways of rolling m-1 dice with faces labeled 0 to g so sum of faces = l-m

Counting clusters at the end of the genome
Gaps are constrained: And sum of gaps is constrained: w-1 m

All configurations of two genomes
Compute probability by counting Configurations with at least one shared cluster of size exactly h Enumerate all permutations that do not contain any clusters of size h or larger All configurations of two genomes

Compute probability by counting … Configurations with at least one shared cluster of size exactly h Enumerate all permutations that do not contain any clusters of size h or larger All configurations of two genomes

Compute probability by counting … Configurations with at least one shared cluster of size exactly h Number of configurations with h genes in a cluster and m-h genes not in the cluster Enumerate all permutations that do not contain any clusters of size h or larger All configurations of two genomes

Why is counting harder in this case?
There are no other homologs within g of this cluster on both genomes I’m trying to count… might think I can look in a neighborhood an look in both directions I didn’t show that in detail but that was the approach before let me show you an example of why that’s not going to work here lot of similarities between alg to find the clusters and how we do counting this particular case there’s this catch-22 that causes problems both in alg and stats. more difficult for stat? top-down doesn’t work? they also comment on this, they were also able to solve using top-down approach, I can’t do that look at this interesting example. this is why can’t use greedy approach before placed white genes around me, alg not always used in practice: in at least one case a greedy heuristic was used instead let’s try a different approach, address this clusters are not “nested”: a cluster of size k doesn’t necessarily contain a cluster of size k-1

There are no other homologs within g of this cluster on both genomes yet this cluster is not maximal Greedy agglomerative approach doesn’t find all clusters There is an efficient divide-and-conquer algorithm to find them (Bergeron, Corteel, Raffinot 2002) lot of similarities between alg to find the clusters and how we do counting this particular case there’s this catch-22 that causes problems both in alg and stats. more difficult for stat? top-down doesn’t work? they also comment on this, they were also able to solve using top-down approach, I can’t do that look at this interesting example. this is why can’t use greedy approach before placed white genes around me, alg not always used in practice: in at least one case a greedy heuristic was used instead let’s try a different approach, address this clusters are not “nested”: a cluster of size k doesn’t necessarily contain a cluster of size k-1

Alternative Approach Enumerate all permutations that do not contain any clusters of size h or larger Dynamic programming Iteratively place “red” or “blue” genes making sure not to create any cluster of size h or larger by judicious placement of “blue genes”

Detecting Homologous Chromosomal Segments A marker-based approach
Formally define a “cluster” Devise an algorithm to identify gene clusters Verify that clusters are statistically significant if keep this slide then add animations, cluster box between it, alg to decide where to put boxes, stats to determine which are significant

Clusters are Commonly Used in Genomic Analyses
Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features ... Need to assess cluster significance highlight some of the variation... in no particular order (but just said we’re not going to require conserved order...) Commonly used definition Here I’ve shown some of the recent genomic analyses that have come out in the past few? Years The ones in yellow all appear to use the max-gap definition Most use monte-carlo (show in color, italics?) Wouldn’t it be nice if there were analytical statistical tests for max-gap clusters?

Whole-genome comparison
n=1000, m=250, g=20 Whole-genome comparison n=1000, m=250, g=10 Probability of observing a maximal max-gap cluster of size h by chance

Summary of Reference Region
Complete clusters test statistic is maximum gap observed, Incomplete clusters gap size a fixed parameter test statistic: cluster size Probabilities are not monotonic with respect to cluster size shown you some results, let me just briefly summarize them and how they fit together we used, but note that we could have used

Example from Genomic Analyses
Blanc et al 2003, recent polyploidy in Arabidopsis Venter et al 2001, sequence of the human genome Overbeek et al 1999, inferring functional coupling of genes in bacteria Vandepoele et al 2002, duplications in Arabidopsis through comparison with rice Vision et al 2000, duplications in Eukaryotes Lawrence and Roth 1996, identification of horizontal transfers Tamames 2001, evolution of gene order conservation in prokaryotes Wolfe and Shields 1997, ancient yeast duplication McLysaght02, genomic duplication during early chordate evolution Coghlan and Wolfe 2002, comparing rates of rearrangements Seoighe and Wolfe 1998, genome rearrangements after duplication in yeast Chen et al 2004, operon prediction in newly sequenced bacteria Blanchette et al 1999, breakpoints as phylogenetic features ... I’m not expecting you to read all these, but this is just to show that there are lots of papers that use this sort of approach. highlight some of the variation... in no particular order (but just said we’re not going to require conserved order...) Commonly used definition Here I’ve shown some of the recent genomic analyses that have come out in the past few? Years The ones in yellow all appear to use the max-gap definition Most use monte-carlo (show in color, italics?) Wouldn’t it be nice if there were analytical statistical tests for max-gap clusters?

Detecting Homologous Chromosomal Segments
Detecting Homologous Chromosomal Segments using genes as genomic markers Similarity in gene content Neither gene content nor order are strictly preserved if keep this slide then add animations, cluster box between it, alg to decide where to put boxes, stats to determine which are significant Formally define a “cluster” Devise an algorithm to identify gene clusters Verify that clusters are statistically significant

Max-Gap Cluster Commonly used in genomic analyses
gap g Commonly used in genomic analyses Several nice properties (cite one of my papers) Efficient algorithms to find them (Bergeron, Corteel, Raffinot 2002) Expandable, ensures minimum local and global density This is their algorithm for paralogon identification as I understand it. They “Compared blastp hits with those of neighboring proteins, scanning them for matches within the same remote chromosomal location”. The “same chromosomal location” was determined by limiting the gap size, d, which is defined as the number of unrelated genes between two paralogs. So here you see two different chromosomes in same genome. Two regions share pairs of paralogs. Distance between any two adjacent paralogs in the same region cannot be more than a “d”. MHW chose a value of d=30 empirically. In Monte Carlo compare the number of clusters containing m paralogs in the observed data and the shuffled data There are times when monte carlo not the right thing selecting parameter values takes time… need something to shuffle

Can we find a significant cluster of (a subset of) the
Given: a genome: G = 1, …, n unique genes a set of m “special” genes warm up question, helpful for wgc also… mention gene order before this (cluster vs. conserved region) I am going to start out by constructing a very simple model of a gene cluster and then extend it to address a variety of ore complex biological situations. Our definition is based on the following scenario: We observe a region containing m genes is seen on one chromosome and paralogs of those m genes are found in a window of size r on another chromosome in the same species (or a distant location on the same chromosome). Here we have five genes in a region of interest and we find homologs in a window of size ten. Notice in that in this case, the cluster is characterized by two parameters: the number of paralogs and the size of the window. introduce incomplete clusters here? discuss order? Let’s look at just one model which is the reference region. Stated formally, in this model, we treat the genome as an ordered list of genes, ignoring intergene distances. We want to calculate the probability of observing m genes in a window of size r. We can calculate the probability of this event occurring in a random genome using simple combinatorics. To determine the probability of this the m prespecified genes appear in a window of size r in any order, note that at least one endpoint of the window must be occupied by a gene in our set of m genes. The remaining m-1 genes may appear anywhere in the remaining r-1 slots. We have n-r full size windows in the genome. This second term deals with edge effects at the end of the genome. The numerator represents all possible ways of placing m-1 genes in r-1 slots. The normalization factor in the denominator is the number of ways m-1 genes can be placed anywhere in the genome. If the order is also prespecified, then the probability of observing the cluster by chance is reduced by a factor of m! Can we find a significant cluster of (a subset of) the m homologs?

Whole Genome Comparison
Given g, what is the probability that h genes form a max-gap cluster in both genomes, if the genes are randomly ordered? all genes are of interest given 2 genomes, want to find all clusters of at least size k want to find all signif clusters in whole genome comparison find them, all different sizes. significant? what’s a question you might ask i say in both genomes make all genome comparison dots bigger

If g=2, the maximal cluster is of size h=4 There is no set of two or three genes that form a max-gap chain in both genomes Greedy agglomerative approach doesn’t find all clusters There is an efficient divide-and-conquer algorithm to find them (Bergeron, Corteel, Raffinot 2002) lot of similarities between alg to find the clusters and how we do counting this particular case there’s this catch-22 that causes problems both in alg and stats. more difficult for stat? top-down doesn’t work? they also comment on this, they were also able to solve using top-down approach, I can’t do that look at this interesting example. this is why can’t use greedy approach before placed white genes around me, alg not always used in practice: in at least one case a greedy heuristic was used instead let’s try a different approach, address this clusters are not “nested”: a cluster of size k doesn’t necessarily contain a cluster of size k-1

ways to choose edge ways to place the m-1 gaps effects
first gene and still have w-1 genes left Make it very clear that first gene in window must be red stress that enumerating by starting point! w cite box

m=5 h=3 g=1

What are the processes by which genomes evolve?
What do homologous regions look like? What features best discriminate homologous regions from random similarities? Cluster definitions and/or algorithms Statistical tests of cluster significance

n = 1000, m = 250, g=20

Let S(m,g,n) denote the set of clusters of size m

X X X X X X X Cluster Length Distance from End of Genome m m+1 … w-1 w
... Distance from End of Genome X X X X X X

X Cluster Length Distance from End of Genome is unconstrained w-1 w-2
... m+1 m Distance from End of Genome

Cluster Length Distance from End of Genome is unconstrained w-1 w-2
... m+1 m Distance from End of Genome

Cluster Length m m+1 … w-1 w X w-2 ... Distance from End of Genome

Enumerating all Possible Complete Clusters
At how many positions in the genome can we place the leftmost red gene? Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that each gap is no greater than g? mention that since complete obviously only one cluster is possible... say either is some window of size w that contains them or there isn’t a cluster?

Possible Cluster Parameters
density = 6/11 Possible Cluster Parameters size: number of red genes in the cluster Example: cluster size ≥ 3 length: number of genes between first and last red genes Example: cluster length ≤ 6 density: proportion of red genes (size/length) Example: density ≥ 0.5 discuss distance in base pairs say “our model” gene’s don’t overlap ignore chromosome breaks Discuss window model, others… max-gap used in practice more than any other no one’s done much investigate how does this model behave?

gap ≤ 4 genes Possible Cluster Parameters size: number of red genes in the cluster Example: cluster size ≥ 3 length: number of genes between first and last red genes Example: cluster length ≤ 6 density: proportion of red genes (size/length) Example: density ≥ 0.5 maximum gap: maximum number of blue genes between adjacent red genes Example: maximum gap ≤ 4 discuss distance in base pairs say “our model” gene’s don’t overlap ignore chromosome breaks Discuss window model, others… max-gap used in practice more than any other no one’s done much investigate how does this model behave?

P-value: the probability that the maximum gap observed between the m genes in a randomly ordered permutation of n genes is no greater than g = the number of ways of positioning the m red genes in the genome so that the maximum gap between them is no greater than g

An example: testing for large-scale duplication
Alternate hypothesis: a set of homologous genes arose through a single duplication event Null hypothesis: homologs arose through multiple, independent duplications  homologs are distributed randomly in the genome Test statistic: ? All close together in this genome and in that genome, is that suprising?

in practice test statistic is often based on intuitive beliefs about what clusters will look like
should be selected to maximize discriminatory power

Max-gap clusters are no longer nested
If g=1, there is a cluster of size = 5 and size = 7 …but no max-gap cluster of size 6

Detecting Homologous Chromosomal Segments A marker-based approach
Identify location of all genes in genome Determine gene homology Formally define a “cluster” Devise an algorithm to identify gene clusters in data Verify that clusters are statistically significant if keep this slide then add animations, cluster box between it, alg to decide where to put boxes, stats to determine which are significant

Model of a genome G = 1, …, n; an ordered set of n unique genes
assume genes do not overlap chromosome breaks ignored

Let S(m,g,n) denote the set of clusters of size m

Possible Extensions Tandem duplications Gene families
Extensions for prokaryotic genomes Gene orientation Circular genomes Physical distance between genes Whole genome comparison

Cluster Length m m+1 … w-1 w X w-2 ... Distance from End of Genome

Enumerating all Possible Complete Clusters
At how many positions in the genome can we place the leftmost red gene? Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that each gap is no greater than g? mention that since complete obviously only one cluster is possible... say either is some window of size w that contains them or there isn’t a cluster?

density = 6/11 Possible Cluster Parameters size: number of red genes in the cluster Example: cluster size ≥ 3 length: number of genes between first and last red genes Example: cluster length ≤ 6 density: proportion of red genes (size/length) Example: density ≥ 0.5 discuss distance in base pairs say “our model” gene’s don’t overlap ignore chromosome breaks Discuss window model, others… max-gap used in practice more than any other no one’s done much investigate how does this model behave?

gap ≤ 4 genes Possible Cluster Parameters size: number of red genes in the cluster Example: cluster size ≥ 3 length: number of genes between first and last red genes Example: cluster length ≤ 6 density: proportion of red genes (size/length) Example: density ≥ 0.5 maximum gap: maximum number of blue genes between adjacent red genes Example: maximum gap ≤ 4 discuss distance in base pairs say “our model” gene’s don’t overlap ignore chromosome breaks Discuss window model, others… max-gap used in practice more than any other no one’s done much investigate how does this model behave?

Gene Cluster Statistics
Rose Hoberman Joint work with Dannie Durand and David Sankoff

Spatial Comparative Genomics
Understand genome evolution Construct comparative maps Transfer knowledge from model organisms Reconstruct chromosomal rearrangements, learn rearrangement rates Infer ancestral genetic map Phylogeny reconstruction Genome self-comparisons to detect ancient whole-genome duplications Determine gene function and regulation in bacteria Predict operons Identify horizontal transfers Learn relationship between spatial organization and functional selection

Spatial Comparative Genomics
Understand genome evolution Construct comparative maps Transfer knowledge from model organisms Reconstruct chromosomal rearrangements, learn rearrangement rates Infer ancestral genetic map Phylogeny reconstruction Genome self-comparisons to detect ancient whole-genome duplications Determine gene function and regulation in bacteria Predict operons Identify horizontal transfers Learn relationship between spatial organization and functional selection Identification of Homologous Segments PREREQUISITE

Conserved Segments Distinct chromosomal regions with
P Q R C D E F A B S T U V This results in conserved segments like this one: two identical sets of homologous genes in same order in distinct chromosomal regions. The yeast genome contains many blocks like this and the order is so well conserved that you can pick them out by eye. The data provides compelling evidence for a recent whole genome duplication. However, the events under study in vertebrates are much older. Small scale mutations can result in changes in gene content and local disruptions in gene order. For example… Distinct chromosomal regions with identical gene content and order.

Genome duplication and divergence
An Example: S T U V C D E F A B P Q R A B S T U V P Q R C D E F However, small scale mutations can result in changes in gene content and local disruptions in gene order. For example… Local mutations

Gene Clusters P Q R C X D E F S T U V C E D F A B S T U V Y The results of this kind of mutation are gene clusters, Distinct chromosomal regions with similar gene content, but the not exactly the same genes, in which gene orderis not preserved. For example, these red regions both contain c,d,e and f, but the lower region also contains gene x and d and e are not in the same order. Distinct chromosomal regions with similar gene content. Gene content and order are not strictly preserved.

1. Sequence anchors are identified through direct sequence comparison
2. Nearby anchors are chained together Pevzner, Tesler Genome Research 2003

Possible Extensions Tandem duplications Gene families
Extensions for prokaryotic genomes Gene orientation Circular genomes Physical distance between genes Exact statistics for whole genome comparison More than two genomes

Acknowledgements Claire D’Aire, Smell U Seymour Butts, CMU
Hugo First, Cliff College NHGRI, David and Lucille Packard Foundation

Gene Families Given: a reference region containing m genes,
This is a very simple model and it addresses the issue of cluster density that was a problem in the paralogon definition, but in order to make it realistic we need to extend it to take more biological scenarios into account. One is gene families. Suppose this green gene has been duplicated more than once. We may not be able to determine which paralog is most closely related. In that case we need to consider the possibility that it matches either one of these green genes. This could result in two possible clusters. We have two windows of size ten, each of which contains a m homologs to the reference region. This causes a problem because now there is more than one clster that matches the reference rgion. is it significant to find m paralogs with a maximum gap of g?

“Evolution of gene order conservation in prokaryotes”
Tamames, Genome Biology 2, 2001

Conserved Segments P Q R C D E F A B S T U V This results in conserved segments like this one: two identical sets of homologous genes in same order in distinct chromosomal regions. The yeast genome contains many blocks like this and the order is so well conserved that you can pick them out by eye. The data provides compelling evidence for a recent whole genome duplication. However, the events under study in vertebrates are much older. Small scale mutations can result in changes in gene content and local disruptions in gene order. For example… Most strict definition: distinct chromosomal regions with identical gene content and order.

Gene Clusters P Q R C X D E F S T U V C E D F A B S T U V Y The results of this kind of mutation are gene clusters, Distinct chromosomal regions with similar gene content, but the not exactly the same genes, in which gene orderis not preserved. For example, these red regions both contain c,d,e and f, but the lower region also contains gene x and d and e are not in the same order. More flexible, intuitive def Many different formal defs, that constrain different features More flexible definition: Distinct chromosomal regions with similar gene content. Gene content and order are not preserved.

Gene Clusters Conserved Segments Identical gene content and order.
P Q R C D E F A B S T U V P Q R C X D E F S T U V C E D F A B S T U V Y This results in conserved segments like this one: two identical sets of homologous genes in same order in distinct chromosomal regions. The yeast genome contains many blocks like this and the order is so well conserved that you can pick them out by eye. The data provides compelling evidence for a recent whole genome duplication. However, the events under study in vertebrates are much older. Small scale mutations can result in changes in gene content and local disruptions in gene order. For example… Identical gene content and order. Similar gene content. Order is not strictly conserved.

Tests for whole genome comparison
Aggregate clusters counts: given a minimum size, maximum gap, is it significant to observe k clusters? Individual clusters What’s the probability of observing a least one max-gap cluster of size at least k?

Aggregate clusters counts: given a minimum size, maximum gap, is it significant to observe k clusters? Individual clusters What’s the probability of observing a least one max-gap cluster of size at least k? probability = 1

Can we find a significant cluster of (a subset of) the
Given: a genome: G = 1, …, n unique genes a set of m special genes Can we find a significant cluster of (a subset of) the m homologs? mention gene order before this (cluster vs conserved region) I am going to start out by constructing a very simple model of a gene cluster and then extend it to address a variety of ore complex biological situations. Our definition is based on the following scenario: We observe a region containing m genes is seen on one chromsome and paralogs of those m genes are found in a window of size r on another chromsome in the same species (or a distant location on the same chromsome). Here we have five genes in a region of intertest and we find homoogs in a window of size ten. Notice in that in this case, the cluster is characterized by two parameters: the number of paralogs and the size of the window. introduce incomplete clusters here? discuss order? Let’s look at just one model which is the reference region. Stated formally, in this model, we treat the genome as an ordered list of genes, ignoring intergene distances. We want to calculate the probability fo observing m genes in a window of size r. We can calculate the probability of this event occuring in a random genome usng simple combinatorics. To deterine the probability of this the m prespecificed genes appear in a window of size r in any order, note that at least one endpoint of the window must be occupied by a gene in our set of m genes. The remaining m-1 genes may appear anywhere in the remaining r-1 slots. We have n-r full size windows in the genome. This second term deals with edge effects at the endo fo the genome. The numerator represents all possible ways of placing m-1 genes in r-1 slots. The normalization factor in the denominator is the number of ways m-1 genes can be placed anywhere in the genome. If the order is also prespecified, then the probability of obswerving the cluster by chance is reduced by a factor of m!

Can we find a significant cluster of the red genes?
Given: a genome: G = 1, …, n unique genes a set of m special genes Can we find a significant cluster of the red genes? How do we formally define a cluster? mention gene order before this (cluster vs conserved region) I am going to start out by constructing a very simple model of a gene cluster and then extend it to address a variety of ore complex biological situations. Our definition is based on the following scenario: We observe a region containing m genes is seen on one chromsome and paralogs of those m genes are found in a window of size r on another chromsome in the same species (or a distant location on the same chromsome). Here we have five genes in a region of intertest and we find homoogs in a window of size ten. Notice in that in this case, the cluster is characterized by two parameters: the number of paralogs and the size of the window. introduce incomplete clusters here? discuss order?

Simillion, Cedric et al. (2002) Proc. Natl. Acad. Sci
Simillion, Cedric et al. (2002) Proc. Natl. Acad. Sci. USA 99, Copyright ©2002 by the National Academy of Sciences

Gene Clusters Provide Evidence of Large-Scale Duplication
Simillion, Cedric et al. (2002) Proc. Natl. Acad. Sci. USA 99, Copyright ©2002 by the National Academy of Sciences Say what??? understanding genome evolution understanding gene function differential gene loss, divergence of function…

Statistical Tests Reference region model with m genes of interest
Complete and incomplete clusters Relationship between cluster parameters and significance Implications for whole-genome comparison present discuss conclude with Discuss some of the challenges extending Analytical???

n = 1000, m = 250

Probability of Observing a Complete Cluster
as g increase, prob increase as m increases, prob describe axis pt out problematic zone 10^-5

Acknowledgements Claire D’Aire, Smell U Seymour Butts, CMU
Hugo First, Cliff College NHGRI, David and Lucille Packard Foundation

Gene Families Given: a reference region containing m genes,
This is a very simple model and it addresses the issue of cluster density that was a problem in the paralogon definition, but in order to make it realistic we need to extend it to take more biological scenarios into account. One is gene families. Suppose this green gene has been duplicated more than once. We may not be able to determine which paralog is most closely related. In that case we need to consider the possibility that it matches either one of these green genes. This could result in two possible clusters. We have two windows of size ten, each of which contains a m homologs to the reference region. This causes a problem because now there is more than one clster that matches the reference rgion. is it significant to find m paralogs with a maximum gap of g?

Trachtulec and Forejt et al., Mammalian Genome 12: 227-231, 2001.
“Synteny of orthologous genes conserved in mammals, snake, fly nematode and fission yeast.” Trachtulec and Forejt et al., Mammalian Genome 12: , 2001. mouse nematode human fly fission yeast This is a study of a gene cluster that is conserved in genomes which span the eukaryote lineage from yeast to mammals. The figure is hard to see. The point I am emphasizing here is that the cluster contains roughly a dozen genes and appears in many different genomes, with variations in gene content and order from genome to genome.

“Evolution of gene order conservation in prokaryotes”
Tamames, Genome Biology 2, 2001

What cluster definition should we use? How evaluate significance?
list proposals here? How evaluate significance? depends on definition sometimes use shuffling

Max-gap Cluster l Remember that the definition of paralogon used in this paper is a pair of regions that share paralogous genes whre the distance btw any two paralogs in the region is <= d. Paralogons are then classified by the number of paralogs, m, that they contain. Here we have two regions with four paralogs. One of them is of size four while the other is of size 13. In general, the size of a region ranges from m to roughly m-1 times d. So for a gap size of 30, a paralogon of six paralogs could range in size from 6 to 156. l

17 Chrom. 17 1 McLysaght,Hokamp, Wolfe, Nature Genetics, 2002
This is an example of what their data looks like. This shows all clusters with six or more duplicates on chromosome 17 in human. These colored boxes show matching gene clusters on other chromsomes. Matching this chromosome, there is one cluster on chromsome 22, three on 12, two on 2, two on 3, one on 7, one on 10 and this very small one on chromsome 1. If we look at this cluster in detail, we see six genes in a window of eight on chromsome 1 and six genes in a window of 9 on chromsome 17. Spatial Evidence for Duplication Excess of regions that share significant similarity in gene content and order. McLysaght,Hokamp, Wolfe, Nature Genetics, 2002

Conserved Segments P Q R C D E F A B S T U V This results in conserved segments like this one: two identical sets of homologous genes in same order in distinct chromosomal regions. The yeast genome contains many blocks like this and the order is so well conserved that you can pick them out by eye. The data provides compelling evidence for a recent whole genome duplication. However, the events under study in vertebrates are much older. Small scale mutations can result in changes in gene content and local disruptions in gene order. For example… Most strict definition: distinct chromosomal regions with identical gene content and order.

Gene Clusters P Q R C X D E F S T U V C E D F A B S T U V Y The results of this kind of mutation are gene clusters, Distinct chromosomal regions with similar gene content, but the not exactly the same genes, in which gene orderis not preserved. For example, these red regions both contain c,d,e and f, but the lower region also contains gene x and d and e are not in the same order. More flexible, intuitive def Many different formal defs, that constrain different features More flexible definition: Distinct chromosomal regions with similar gene content. Gene content and order are not preserved.

Gene Clusters Conserved Segments Identical gene content and order.
P Q R C D E F A B S T U V P Q R C X D E F S T U V C E D F A B S T U V Y This results in conserved segments like this one: two identical sets of homologous genes in same order in distinct chromosomal regions. The yeast genome contains many blocks like this and the order is so well conserved that you can pick them out by eye. The data provides compelling evidence for a recent whole genome duplication. However, the events under study in vertebrates are much older. Small scale mutations can result in changes in gene content and local disruptions in gene order. For example… Identical gene content and order. Similar gene content. Order is not strictly conserved.

Cluster Properties Different definitions ensure different properties
is the gap between genes in the cluster smaller than the distance between the cluster and other genes of interest? overlapping clusters that cannot be merged do clusters of comparable size have comparable length? wide range of global densities do the gap sizes vary substantially? wide range of local densities A second problem is modeling the distribution of overlapping clusters. In the case of ndividual clusters, we need this inorder to get better estimates of the probability of observing at least one cluster. In whole genome comparison, we’d like to be able to compute p-values. We need a probability distribution for the event of observing k clustres in a whole gneome comparison.

A gap between statistical tests and defs in practice
Genomic Analyses Statistical Models Blanc03 Chen04 McLysaght02 HumanGenomeScience Overbeek99 Simillion99 Tamames01a Vandepoele02b Vision00 ... list how many here ??? Calabrese03 Danchin03 Durand03b Ehrlich97 Trachtulec01 Others??

Alternative approach: enumerate all permutations that do not contain any clusters of size h or larger Simple dynamic programming algorithm j=2 m = 1 c = 2 n = 4 doesn’t actually show re-use of dyn prog.

Alternative approach: enumerate all permutations that do not contain any clusters of size h or larger Simple dynamic programming algorithm doesn’t actually show re-use of dyn prog.

Alternative Approach Dynamic programming n total number of genes to place m number of “special” genes c size of cluster so far j gap size to previous cluster Enumerate all permutations that do not contain any clusters of size h or larger j=2 m = 1 c = 2 n = 4

Complete Clusters If the red gene is located at the beginning of the genome, how many ways are there to choose the positions of the remaining red genes so that they form a max-gap cluster?

Complete Clusters In how many ways can we choose the starting position of a cluster of size m? w Let’s look at just one model which is the reference region. Stated formally, in this model, we treat the genome as an ordered list of genes, ignoring intergene distances. We want to calculate the probability fo observing m genes in a window of size r. We can calculate the probability of this event occuring in a random genome usng simple combinatorics. To deterine the probability of this the m prespecificed genes appear in a window of size r in any order, note that at least one endpoint of the window must be occupied by a gene in our set of m genes. The remaining m-1 genes may appear anywhere in the remaining r-1 slots. We have n-r full size windows in the genome. This second term deals with edge effects at the endo fo the genome. The numerator represents all possible ways of placing m-1 genes in r-1 slots. The normalization factor in the denominator is the number of ways m-1 genes can be placed anywhere in the genome. If the order is also prespecified, then the probability of obswerving the cluster by chance is reduced by a factor of m!

Dynamic programming Algorithm
total number of genes to place c size of cluster so far k number of special genes j distance to previous cluster n = 6, m = 4, h = 3, g=1 c = 0, j = k = 4 8 r = 6 k = 4 c = 1, j = 0 c = 0, j = k = 3 8 r = 5 r = 5 k = 2 c = 1, j = 1 c = 1, j = 0 k = 3 r = 4 r = 4 mention recursion = left + right Separate into multiple slides X k = 2 c = 2, j = 0 c = 2, j = 0 k = 2 r = 3 r = 3 X c = 2, j = 1 X r = 2

Aggregate clusters counts: given a minimum size, maximum gap, is it significant to observe k clusters? Individual clusters What’s the probability of observing a least one max-gap cluster of size at least k?

Aggregate clusters counts: given a minimum size, maximum gap, is it significant to observe k clusters? Individual clusters What’s the probability of observing a least one max-gap cluster of size at least k? probability = 1

Given: m genes of interest
How close together do the m homologs have to be so that we are convinced that this cluster cannot be attributed to chance? mention gene order before this (cluster vs conserved region) I am going to start out by constructing a very simple model of a gene cluster and then extend it to address a variety of ore complex biological situations. Our definition is based on the following scenario: We observe a region containing m genes is seen on one chromsome and paralogs of those m genes are found in a window of size r on another chromsome in the same species (or a distant location on the same chromsome). Here we have five genes in a region of intertest and we find homoogs in a window of size ten. Notice in that in this case, the cluster is characterized by two parameters: the number of paralogs and the size of the window. introduce incomplete clusters here? discuss order? Let’s look at just one model which is the reference region. Stated formally, in this model, we treat the genome as an ordered list of genes, ignoring intergene distances. We want to calculate the probability fo observing m genes in a window of size r. We can calculate the probability of this event occuring in a random genome usng simple combinatorics. To deterine the probability of this the m prespecificed genes appear in a window of size r in any order, note that at least one endpoint of the window must be occupied by a gene in our set of m genes. The remaining m-1 genes may appear anywhere in the remaining r-1 slots. We have n-r full size windows in the genome. This second term deals with edge effects at the endo fo the genome. The numerator represents all possible ways of placing m-1 genes in r-1 slots. The normalization factor in the denominator is the number of ways m-1 genes can be placed anywhere in the genome. If the order is also prespecified, then the probability of obswerving the cluster by chance is reduced by a factor of m!

Given: m genes of interest
How close together do the red genes have to be so that we are convinced that this cluster cannot be attributed to chance? mention gene order before this (cluster vs conserved region) I am going to start out by constructing a very simple model of a gene cluster and then extend it to address a variety of ore complex biological situations. Our definition is based on the following scenario: We observe a region containing m genes is seen on one chromsome and paralogs of those m genes are found in a window of size r on another chromsome in the same species (or a distant location on the same chromsome). Here we have five genes in a region of intertest and we find homoogs in a window of size ten. Notice in that in this case, the cluster is characterized by two parameters: the number of paralogs and the size of the window. introduce incomplete clusters here? discuss order?

How do we quantify close together?
size = 5 genes length = 10 genes density = = 0.5 gap = 3 genes How do we quantify close together? length: number of genes between first and last gene density: proportion of genes of interest (size/length) compactness: maximum gap between the genes discuss distance in base pairs say “our model” gene’s don’t overlap ignore chromosome breaks

Tests for incomplete gene clusters
Enumerating clusters by starting position will lead to overcounting clusters of size h may overlap there may be more than one incomplete cluster m = 6, h = 3, g = 1 remember, how many places start, chosen start pos, how many fill in? why can’t we use that approach here? To summarize the analysis of individual gene clusters, we modeled the probability of finding k copies of a partial clsuter containing a subset of h or m gene families in a window of size r in a randome genome. Our test statistics include the expected number of such clusters, the probability of findingn one such cluster and he probability that a sample of windows shares h homologous genes. still: at least one, chose the starting position of the cluster, then the number of ways to fill in the cluster What’s the probability of observing at least one cluster of size at least 3?

How do we quantify close together?
size = 5 genes length = 10 genes density = = 0.5 gap = 3 genes How do we quantify close together? length: number of genes between first and last gene density: proportion of genes of interest (size/length) compactness: maximum gap between the genes discuss distance in base pairs say “our model” gene’s don’t overlap ignore chromosome breaks

A max-gap cluster of size m
Cluster Length Example: m = 5 g g g g w A max-gap cluster of size m has a maximum length of give example with numbers?

In how many ways can we choose the starting position of a cluster of size m?
Make it very clear that first gene in window must be red

Final probability:

Final probability: ???

Almost Complete Clusters
When h > m/2 only one cluster of size h possible we can use the same counting approach as for complete clusters expression for probability in paper add picture of 7/20 red

More generally... Number of clusters is not monotonic with cluster size
When g = 0

Cluster Definitions and Algorithms Related Work
conserved segment: contiguous sequence of genes in the same order

conserved segment r-window: pair of windows of r genes, of which k genes are shared

conserved segment r-window common interval: set of genes adjacent in both genomes

conserved segment r-window common interval max-gap: set of genes, gap between adjacent genes is small

conserved segment r-window common interval max-gap graph-theoretic defs: such as connected components or minimum-weight paths

conserved segment r-window common interval max-gap graph-theoretic defs heuristics: sets of genes forming a rough diagonal on the dot-plot

How to Define a Cluster? constructive or algorithmic definition

Similar presentations

Presentation on theme: "How to Define a Cluster? constructive or algorithmic definition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to Define a Cluster? constructive or algorithmic definition

Similar presentations

Presentation on theme: "How to Define a Cluster? constructive or algorithmic definition"— Presentation transcript:

Similar presentations

About project

Feedback