Presentation is loading. Please wait.

Presentation is loading. Please wait.

Trees & Topologies Chapter 3, Part 2

Similar presentations


Presentation on theme: "Trees & Topologies Chapter 3, Part 2"— Presentation transcript:

1 Trees & Topologies Chapter 3, Part 2

2 A simple lineage Consider a given gene of sample size n.
How long does it take before this gene coalesces with another gene in the sample? Wn – denotes the time until the gene merges with another gene, E(w2) = E(t2) = 1 For a given n, the gene is involved in the first coalescent event with probability 2/n. If this is so, then Wn = Tn, which provides the first term on the right side of the equality sign. With probability 1 – 2/n the gene is not involved in the coalescent event which implies that Wn = Tn + Wn-1; Tn plus the time Wn-1 until the gene coalesces in a sample of size n-1. Now that E(Tn) = 2 / (n(n-1)), the recursion simplifies and has the solution of E(Wn) = 2/n

3 Single Lineage How many events pass before it coalesces with another gene? E(Tn) is replaced by 1, and Wn by Cn. Initial condition is E(C2) = 1 because there is only 1 coalescent even in the history of the sample of size 2.

4 Disjoint subsamples Consider a sample of size n that is divided into two disjoint subsamples, A and B of sizes k and n-k, respectively. Last time we talked about nested subsamples, where the samples in B were also in A. This is the opposite situation where none of the samples in B are in A Illustration of division of a sample into disjoint subsamples A and B, where all genes in subsample A coalesce before any of them coalesce with a gene in subsample B. T(A-1) denotes the time of the MRCA of subsample A, T(A-0) denotes the time of absorption of the last lineage of A into the genealogy of B.

5 Disjoint Subsamples (cont’d)
The probability that all genes in A find a MRCA coalescing with any gene in B is: The probability that one of the two samples finds a MRCA before coalescing with members of the other sample is: qn-k,k is the probability that all genes in B find a MRCA coalescing with any gene in A, so the roles of A and B are reversed The term in the red box is the probability that all genes in A and B find MRCAs before the two subsamples find a MRCA. In other words, it is the probability that k and n-k labelled genes subtend from the deepest split in the tree. This term is subtracted because it is included in both other terms. (illustrated in next slide).

6 Disjoint Subsamples (cont’d)
probability that all genes in A and B find MRCAs before the two subsamples find a MRCA. Illustration of division of the sample into disjoint subsamples A and B, such that all genes in A and B find MRCAs before the 2 subsamples find a MRCA. The probability of this is given in the last term of the equation from the previous slide.

7 Jump Process of Disjoint Subsamples
Jump processes: (i,j) -> (i-1, j) with probability (i+1)/(i+j) (i,j) -> (i,j-1) with probability (j-1)/(i+j) Process starts in (k, n-k) and continues until (1,j) for some j. Eventually jumps to (0,j) for some j and finally reaches (0,1), where 0 denotes that sample A has been fully absorbed into B. We talked about the jump process last time. This describes which pair of genes coalesce at each coalescence event. Here 3 quantities are of interest. To simulate the genealogy of A and B, one might first simulate the jump process and the subsequently simulate the time between events. Ta-1 = time of the MRCA of A Ta-0 = time until the last lineage of A is absorbed into the genealogy of B ANCB = the number of ancestors of B at the time of the MRCA of A These quantities are key numbers in the description of the genealogy of the whole sample. One can give full probabilistic treatments of these quantities deriving the densities of all three. However, we’re just going to look at their mean values. If both k and n-k are large, E(ANCb) ~ 3/x – 2, where x = k/n is the frequency of A in the sample

8 Disjoint Subsamples Example
Sequence 4,200 bp of the PHDA1 gene in 35 male individuals from different human populations. 16 individuals were from African populations and 19 were from non-African populations. The gene tree constructed after removing 1 incompatible site and rooted using a chimpanzee sequence is shown above. Note that the MRCA is estimated to be almost 2 million years old. Haplotypes A and B were found in non-African population only, the remaining haplotypes in African populations only. The tree was rooted with a chimpanzee sequence. What is the probability that a subsample A of size k = 19 coalesces before coalescing with an ancestor of the n – k = 16 remaining genes. Look back at the equation in slide 5 (qk,n-k) and we get 4.5 x 10-11 Gene tree of the PHDA1 gene from a sample of Africans and non-Africans.

9 A sample partitioned by a mutation
Now, consider a sample of size n where a polymorphism divides the sample into two disjoint subsamples, A and B, of size k and n-k, respectively. Assume that just 1 mutation separates them and that the scaled mutation rate is very low. This is often believed to be the case for many SNPs. Here the mutation rate for a single locus might be as low as u = 10-8 per generation. If we put N=104 and then theta = 4Nu = 4x10-4, which is almost zero. Let us for now assume that we know which allele is the oldest, say sample B carries the oldest allele. In order for the whole sample to find a MRCA sample A must find a MRCA before coalescing with any sequence in B and further, a mutation must have occurred on the branch connecting the genealogy of A with that of B. (figure on slide 4) In this case, TA-0 is now TMUT, the age of the mutation. These expressions are fairly unintuitive and difficult to handle, but become more manageable if k and n-k are large. Figure on the next slide compares the mean values conditioned on a mutation to the values obtained without conditioning on a mutation causing the split between A and B.

10 Comparing the mean values
Compares the mean values conditioned on the mutation to the value obtained without conditioning on a mutation causing the split between A and B. The dotted curve is the ratio of the E(Tmut|Mutataion) to E(Ta-o). The solid curve is the E(ANCb|Mutation) to E(ANCb). Since Tmut and Ta-o do not exactly measure the same time, the dashed line falls below 1 for large frequencies. Jump processes: (i,j) -> (i-1,j) with probability i/(i+j-1) (i,j) -> (i, j-1) with probability (j-1)/(i+j-1)

11 Unknown ancestral state
If we do not know which of the two alleles is older, we have a slightly different situation. Probability that an allele found in frequency k out of n genes is the oldest is k/n. Probability that A carries the mutant allele is 1-k/n = (n-k)/n. Jump processes become: (i,j) -> (i-1,j) with probability i/(i+j) (i,j) -> (i, j-1) with probability j/(i+j) Both i and j can become zero (but never both of them). j/(i+j) – probability that A carries the mutant allele i/(i+j) – probability that B carries the mutant allele

12 The age of the MRCA for two sequences
Now consider the situation of two sequences with S2 = k segregating sites. Before we assumed that there was only 1 mutation in the sample and that the mutation rate was negligible. Here we consider that the situation of two sequences with S2 = k segregating sites and want to evaluate the time until their MRCA, conditional on the k mutations. In particular the mean is (equal above) which is larger or smaller than the unconditional mean time E(T2) = 1, according to whether k >= theta or k <= theta, respectively.

13 Probability of going from n ancestors to k ancestors
Where n[i] = n(n-1)…(n-i+1) and n(i) = n(n+1)…(n+1-1) The probability of different # of ancestors starting with 7 ancestors at time 0. The curve starting with probability 1 at time 0 and then descending towards 0, is the probability of the 7 ancestors. The curve that peaks first is the probability of 6 ancestors, then followed by the curve for having 5 ancestors, etc. The curve that starts in 0, but converges to 1 as time increases, is the probability of only 1 ancestor to the sample. At each time point the curves sum up to 1. Probability of different number of ancestors starting with seven ancestors at time 0.

14 Probability of going from n ancestors to k ancestors
Probability of different number of ancestors starting with seven ancestors at time 0 and ending with 4 ancestors at a different time. The probability of different numbers of ancestors starting with 7 ancestors at time 0 and ending with 4 ancestors at different times. The curve that has probability 1 at time 0 is the probability of seven ancestors. The curve that has probability 1 at the right end of the interval is the probability of four ancestors. The curve that peaks first is the probability of 6 ancestors, followed by the curve for having 5 ancestors. At each time point the curves add up to 1. Since the coalescent process is Markovian, it is easy to find the probability of a descending # of ancestors to a series of increasing times. First I show the probability that there are m ancestors at time t1 and k at time t. Thus the probability of m ancestors at time t1, when we started with n and ended with k at time t is P(m|n… ) The figure shows where n = 7, k =4.

15 Probability of going from n ancestors to k ancestors
The probability that a sample of 3 genes have 2 ancestors at time t. An area is indicated with a set of sequences – these sequences have not found any common ancestors. The area with 1,2,3 still have distinct ancestors after time t. It has probability e-3t. Elementary calculations allow the probability of all events to be found. Illustration of 3 genes having 2 ancestors. Each circle is the event where the 2 genes have not found an ancestor after time t and has probability e-t. The intersection of all 3 circles has probability e-3t , since it is the event that no genes have found common ancestors. The 3 intersections of the two circles not including the 3rd circle have probability (e-t – e-3t) / 2. Note that this figure can be misleading because the parts of the circle outside the intersections is empty: If 2 pairs of genes have found common ancestors at time t then all 3 pairs have. Therefore, all probabilities of the areas are defined and it is a simple addition-subtraction exercise to obtain probabilities of interest. The event where the 3 genes have 2 ancestors are the areas that are contained in 2 circles. The area outside any circle are the events where all have found a common ancestor. Probability that a sample of three genes have two ancestors at time r.

16 Questions? Slides are available on the Wiki at:


Download ppt "Trees & Topologies Chapter 3, Part 2"

Similar presentations


Ads by Google