Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tests for Gene Clustering

Similar presentations


Presentation on theme: "Tests for Gene Clustering"— Presentation transcript:

1 Tests for Gene Clustering
Written by: DANNIE DURAND and DAVID SANKOFF Presented: Hessah Alsubaie

2 November 18, 2018 Introduction: The size of gene family in the genome where the cluster found. SIMPLE CLUSTER PROBABILITIES SIGNIFICANCE OF INDIVIDUAL GENE CLUSTERS The significance of a putative cluster depends on how it was found. 1- Reference region. case that G does not contain gene families case that Clusters in genomes contain gene families case of incomplete cluster 2- Window sampling. 3-Whole genome comparison.

3 November 18, 2018 Notation: The probability of observing a cluster in a random genome is denoted by q( ). The expected number of clusters in a random genome denoted by S( ). The probability of observing at least one cluster in a random genome denoted by P( ). Test statistics based on window sampling are subscripted with a W . The superscripts o indicate orthologous and p indicate paralogous clusters. - The superscripts and F refer to gene families. - The superscripts H refer to incomplete clusters.

4 November 18, 2018 Definitions: Gene families: Are the sets of genes with similar sequence and function, that arose through duplication of genetic material. Gene cluster: Is a part of the gene family. When tow or more genes within a genome that encode for similar function or attributes. Conserved cluster: Is a set of two or more distinct (non overlapping) chromosomal regions that have m genes in common. Paralogous: The cluster is paralogous if both regions are in the same genome. Orthologous: The cluster is orthologous if the regions are in different genomes.

5 SIMPLE CLUSTER PROBABILITIES
November 18, 2018 SIMPLE CLUSTER PROBABILITIES The probability of observing such a cluster in a genome with uniform random gene order: - Let G=(1,…,n) be a set with order n genes. - Let M be a preselected set of m genes. In the case that the genes of M are found in random order in a window of exactly r slots in G, the first and last of the r slots contain 2 of the m. The probability of this case is:

6 The probability of observing such a cluster in a genome with uniform random gene order :
November 18, 2018 In case of that m genes span at most r slots in G, so It is sufficient one of the end points of window be occupies by one of the m genes. As result, the probability of this case is: where the matrix in the numerator addresses the edge effects.

7 November 18, 2018 In the case that the genes in M appear in a given order, the probability will be : The probability of finding a cluster by chance: This probability will depend on the size of the window relative to the genome size and the fraction of slots in the window that are occupied by intruders. So, For large n and m < r. We can obtain the probability by applying the Stirling’s approximation to this equation:

8 November 18, 2018 then we should have: where and are two parameters introduced to represent window proportion and window sparsity, respectively.

9 The significance of a putative cluster depends on how it was found:
November 18, 2018 The significance of a putative cluster depends on how it was found: Reference region: The investigator is interested in particular genomic region and look for an addition regions contain the same genes. Window sampling: A cluster may be found by selecting a pair of window and comparing their gene content for shared homologs. Whole genome comparison: Individual gene cluster may found through whole genome scans.

10 November 18, 2018 Test statistics The probability of observing a single cluster of prespecified genes such as in the probability q( n, m, r) is a measure of its statistical significance. In different case the probability of observing at least one such cluster may be used to test significance. In fact, this probability is difficult to calculate because some sets of genes that meet the criterion intersect, so that the events under consideration are not independent. In general it is easy to calculate the expected number of clusters of a given type.

11 Reference Regions In the Following case:
November 18, 2018 Reference Regions In the Following case: Suppose a chromosomal region in one genome is of particular interest and a second region is found by scanning a different genome, G, for genes found in the first region. And M defined to be the genes found in the reference region. Test: We can use the probability of prespecified m genes to test the significance of cluster found in same way.

12 November 18, 2018 In this case: Suppose G does not contain gene families and all m genes are found in the window in G. Test: we use the same equations for simple cluster probability to test whether a specific set of m genes is more highly clustered than by chance

13 In case that Clusters in genomes contain gene families:
November 18, 2018 In case that Clusters in genomes contain gene families: - Orthologous cluster in genomes with gene families. - M will be defined to be a set of m prespecified gene families such as M ={ f1…..fm}. - No two genes are member of same gene family. - Let M be the set of distinct sets of genes that are homologous to M in Gi. So we have the following number of sets in M

14 November 18, 2018 For each of these sets, the probability that it span at most r slots is q(n,m,r). As a result, the expected number of homologous clusters is: This equation does not cover the expected number of distinct cluster. Test: will be depend on the probability of observing at least one homologous cluster. Where Ei the event that set of M found in window of size r.

15 Let be the event that happens in a window of size r in G such as
November 18, 2018 A better approximation can be rated using the inclusion–exclusion rule for correcting the overlapping clusters: Let be the event that happens in a window of size r in G such as Then, by the inclusion-exclusion rule, the probability is: The expected number of windows of size 2r-m+1 containing an extra member of one of the m families.

16 November 18, 2018 This equitation estimated the second term, since not every window will be union of 2 windows of size r and contain a complete cluster. represents a first order approximation to the probability that at least one cluster appears

17 In case of incomplete cluster:
November 18, 2018 In case of incomplete cluster: let H be the set of all subsets of M of size h < m. No gene families. the probability of appearing a specific subset of H in a window spanning at most r slots is q(n, h, r). So, the expected number of such subsets is: Same as in the previous calculation. The probability of observing at least one incomplete cluster will be estimated by inclusion-exclusion rule to correct the overlapping cluster.

18 From SH(n, h, m, r) and S’H(n, h, m, r) we obtain:
November 18, 2018 In this case: The first term of this equation will be SH(n, h, m, r). The remaining set will be calculate according to a pair of subset of size h-1. which is the expected number of windows of size 2r-h+1 contain h+1 of m genes. From SH(n, h, m, r) and S’H(n, h, m, r) we obtain:

19 November 18, 2018 Which represent the first approximation of the probability that at least one incomplete cluster of size h is appear. Finally, a simplified model can be obtained under the assumption that the gene family size is uniform over all gene families. In this case, the quantity can be replaced by , and the probability will be:

20 November 18, 2018 To summarize :

21 Thank you for listening 
November 18, 2018 Thank you for listening 


Download ppt "Tests for Gene Clustering"

Similar presentations


Ads by Google