Getting Parameters from data Comp 790– Coalescence with Mutations1
1. Introduction No fun
2. Estimating θ Unlike ρ,θ does not shape the genealogy, it modifies genetic types Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters. More segregating sites -> More information
2. Estimating θ 2.1 Watterson’s estimator – One of the two very popular estimators – The estimator has mean θ:
2. Estimating θ The estimator is unbiased and variance decreases with increasing ρ: recombination breaks up linkage and reduces correlation between sites Under exponential growth, it is downwards biased. Under migration, it is upwards biased, because the MRCA tends to be pushed further back in time.
2. Estimating θ 2.2 Tajima’s estimator – π ij : The number of sites that differ between sequence i and j. – π ij has mean θ, because π ij is the number of segregating sites in a sample of size two. – As a consequence,
2. Estimating θ
Code for Watterson Estimator
2. Estimating θ
2.3. Fu 1994
2. Estimating θ 2.3. Fu 1994
2. Estimating θ Fu’s other estimators: i-Mutation: only good with large n UPBLUE: does not generalize to settings with recombination
3. Estimating ρ Estimating ρ is hard, both statistically and computationally. – Using infinite side model, all mutation events can be listed whereas all recombination events cannot. A recombination can only be inferred with certainty if all four gametes are present. – The number of possible genealogical relationship between sequences subject to recombination is unlimited.
3. Estimating ρ Both T M and H M might overlook important information in the data. They only provides lower bounds to the number of recombination events. (T M: Least number of gene trees required to explain the sample. H M : ) Assume two non-recombining loci with rate ρbetween them. Likelihood function: L(θ1, θ2, ρ) = P θ1, θ2, ρ (S 1,n =k1,S 2,n =k2) For n=2, the two extreme cases: L(θ1, θ2, 0) = L(θ1, θ2, ∞) =
3. Estimating ρ Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and k2=7, L(θ1, θ2, 0)= L(θ1, θ2, ∞) = (θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞) The likelihood supports two unlinked loci(ρ= ∞) more than two completely linked loci(ρ=0). ρ>0 even though the data passes the four gamete test and has T M = H M = 0. recombination can be inferred even in the absence of incompatibilities. If k1=k2=4, the likelihood supports two complete linked loci.
3. Estimating ρ Recombination is difficult to take into account in an analysis and it is tempting to ignore it or assume that its effects are minor. Unfortunately it is not true. 3.1 Estimators based on summary statistics – Wakeley’s estimator: Wakeley(1997) A complicated function of a complicated function of ρ. ( the form is fully know though ) Large variance. The expectation doesn’t strongly depend on ρ. – Likelihood and summary statistics: Wall(2000) Infer ρ based on the likelihood of (Sn, Kn, T M ) where Kn is the number of haplotypes in the sample.
3. Estimating ρ 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead & Donnelly(2001) – Consider all pairs of segregating sites, ignoring all non-polymorphic sites – Let n ij denote the vector of gamete counts for sites i and j (00,01,…). Let ρ ij be the scaled recombination rate: – Probability of obtaining n ij given that both i,j are polymorphic – The proposed pseudo-likelihood function is: – Then to estimate ρ from this pseudo-likelihood. It depends on ρ only. Because ρ ij is assumed proportional to sequence length.
3. Estimating ρ It is pseudo: – 1.Only likelihood of pairs of segregating sites are considered – 2. Pairs are treated as independent of each other – 3.The likelihood of a pair is conditioned on the pair being segregating in both loci
4. Monte Carlo methods The principle: throw dice several times and calculate the average – Var(g(X))/M Use Monte Carlo integration to find P(Sn=k) : A naïve approach – Simulate genealogies of n genes, add mutations and count the number of times a genealogy has exactly k mutations. – Many simulated genealogies will not contribute to the sum.
4. Monte Carlo methods A better approach: Write P(Sn=k) in the form of an integral – Rearrange the terms and we can get: Where X is gamma distributed with parameters k+1 and θ/2 – Or: Where Ln is the sum of all branches in the coalescent tree. – It’s better because every simulated values counts Comp 790– Continuous-Time Coalescence21
4. Monte Carlo methods Comp 790– Continuous-Time Coalescence22
4. Monte Carlo methods 4.1 Likelihood curve – Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a whole range of θ values – E.g.: to calculate L(θ) = P θ (Sn=k) for a large range of θs and single out the θ value with highest probability. – Recall: – Simulate y1,….y M from Ln and calcuate the empirical average: – Note that only one set of simulations is performed and is used to calculate the likelihood for all θ Comp 790– Continuous-Time Coalescence23
4. Monte Carlo methods Alternatively, recall: one can extend it this way: consider some fixed θ 0. for any θ Using Monte Carlo technique, the integral can be approximated by: Where x1,…x M are M values obtained from the proposal distribution gamma(k+1, θ 0 ). θ 0 is called the ‘driving value’. An appropriate choice of θ 0 could be a simple estimator of θ. E.g. Watterson’s estimator Comp 790– Continuous-Time Coalescence24
4. Monte Carlo methods Comp 790– Continuous-Time Coalescence25
4. Monte Carlo methods 4.2 Monte Carlo integration and the coalescent – Full likelihood of a sample under a coalescent model: – H: historyD: data – Let’s define H here: – N!(n-1)!/2^(n-1) different coalescent topologies – Impossible to sum up all these. ( we haven't even considered recomb) Comp 790– Continuous-Time Coalescence26
4. Monte Carlo methods A naïve Monte Carlo approach: – It’s not efficient for most of the coalescent topologies will not be compatible with D. most simulations do not contribute to the likelihood. – A four sequence example: (1/3 compatible ) Comp 790– Continuous-Time Coalescence27
4. Monte Carlo methods Importance Sampling: – Reduce the variance of the estimated probability – Reduce the number of simulations that contribute little to likelihood Instead of choosing histories from distribution P θ (H), sample histories from a proposal distribution Q(H) Now the likelihood of data can be approximated by: Comp 790– Continuous-Time Coalescence28
4. Monte Carlo methods Ideally, one would like to sample from, where because in that case the approximation becomes exact: Not feasible approach. A proposal distribution between and Giffiths and Tavare(1994), Stephens and Donnelly(2000) Comp 790– Continuous-Time Coalescence29
4. Monte Carlo methods Giffiths and Tavare(1994): Let’s go back to Infinite Site Model Comp 790– Continuous-Time Coalescence30
Giffiths and Tavare(1994): H is defined as a path through the diagram. H has probability defined by the product of weights attached to the edges that belong to H. E.g.H’ follows the rightmost path : – 1 st term of Q(H’): – θ 0 is the driving value – Last five terms are all Comp 790– Continuous-Time Coalescence31
4. Monte Carlo methods P θ (D|H’) = 1 ??? is the product of coalescent probabilities of the events defining the history: – Coalescent -> mutation -> coalescent … The factor in front of a fraction is the probability that a mutation happens in a given lineage(s) or that a coalescent event happens amongst certain pair of genes Comp 790– Continuous-Time Coalescence32
4. Monte Carlo methods 4.3 Markov Chain Monte Carlo – Kuhner et al.(1995,1998) – Finite sites model – All coalescent topologies are compatible with data – Likelihood ratio: Comp 790– Continuous-Time Coalescence33
4. Monte Carlo methods The importance sampling function is: Use Metropolis-Hastings algorithm to construct a Markov Chain with distribution Q(H) The benefit of the approach is that the Markov Chain tends to stay in areas of the tree space that suport the data well before moving to another area Comp 790– Continuous-Time Coalescence34