Population genetics
coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to fight the invaders.
4Nu determines the level of variation under the neutral model: The coalescent
Each two alleles have a common ancestor -> can be represented by a tree. The coalescent
The genealogy of the sample. The alleles might be the same by state or not. s3s4 s1s2 T = t 2 T = 0 T = t 3 T = t 1 The coalescent
Define T i to be the time needed to reduce a coalescent with i alleles to a one with i-1 alleles. Thus, T 4 =t 1, T 3 =t 2 -t 1, and T 2 =t 3 -t 2. Joining these equations we obtain: Or in general for n alleles: s3s4 s1s2 T = t 2 T = 0 T = t 3 T = t 1 The total time in the coalescent is: The coalescent
n alleles Focusing on the last generation. For 2 alleles, what is the probability that they have different ancestors in the previous generation? n-1 alleles Tc is a function of N=population size and n=number of alleles in the sample. We can compute Tc assuming the infinite allele model. The coalescent
Assuming N is very big, and thus ignoring terms in which N 2 appears in the denominator, we obtain: We have n alleles. What is the probability that they all have different ancestors in the previous generation? The coalescent
The probability that at least 2 allele out of n alleles have a common ancestor in the previous generation? This is the probability of a coalescent in each generation The probability that n alleles have different ancestors in the previous generation? The coalescent
The number of generation till a coalescent is geometrically distributed with p=n(n-1)/4N. Thus, the expected time till a coalescent event is 1/p=4N/n(n-1). In other words: The probability of a coalescent in a single generation is: The coalescent
From the following two equations, we can obtain E(T c ) The coalescent
The coalescent: adding mutation. s3s4 s1s2 T = t 2 T = 0 T = t 3 T = t 1 The n alleles are either the same by states or not. Each mutation in the history of these alleles resulted in a segregating site. If there was one mutation, there is one segregating site. If there were 2 mutations, there are 2 segregating sites (the infinite allele model). In general: k mutation -> k segregating sites.
The coalescent Let u be the mutation rate per generation. Thus, the total number of mutation in a coalescent is, on average, uT c, which is: Since S can be estimated from the sample (i.e., the number of segregating sites observed) we can get an estimate of θ. But, this is exactly the expectation of the number of segregating sites, S
The coalescent Example: Assume 11 sequences, each 768 nucleotides, were sampled and 14 segregating sites were found. Estimate θ for each allele (sequence) and for each nucleotide site. Here, n=11 and the sigma equals to E(S) is estimated to be 14, and hence the estimate of θ is 14/2.929 = Hence 4Nu is estimated to be 4.78, for u which is the allele mutation rate. 4Nu in which u denotes the nucleotide mutation rate is 4.78/768 =
The coalescent A few words about the harmonic series: 1.The sum is infinite. Proof: 2.The partial sum converges in the sense that So the rate of growth of the series is the same as that of ln(n). For the series to be equal 3, one needs 10 samples. For the series to be equal 4, one already needs 30 samples.
The coalescent We thus have 2 methods for estimating θ. 1.Based on the general heterozygosity: 2.Based on the number of segregating sites:
The coalescent The estimation based on general heterozygosity does not use the information from each site. The contrast between the two formulas can be used to test the neutral theory (Tajima ’ s D test).