Presentation is loading. Please wait.

Presentation is loading. Please wait.

Getting Parameters from data 2016-6-5Comp 790– Coalescence with Mutations1.

Similar presentations


Presentation on theme: "Getting Parameters from data 2016-6-5Comp 790– Coalescence with Mutations1."— Presentation transcript:

1 Getting Parameters from data 2016-6-5Comp 790– Coalescence with Mutations1

2 1. Introduction No fun

3 2. Estimating θ Unlike ρ,θ does not shape the genealogy, it modifies genetic types Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters. More segregating sites -> More information

4 2. Estimating θ 2.1 Watterson’s estimator – One of the two very popular estimators – The estimator has mean θ:

5 2. Estimating θ The estimator is unbiased and variance decreases with increasing ρ: recombination breaks up linkage and reduces correlation between sites Under exponential growth, it is downwards biased. Under migration, it is upwards biased, because the MRCA tends to be pushed further back in time.

6 2. Estimating θ 2.2 Tajima’s estimator – π ij : The number of sites that differ between sequence i and j. – π ij has mean θ, because π ij is the number of segregating sites in a sample of size two. – As a consequence,

7 2. Estimating θ

8 Code for Watterson Estimator

9 2. Estimating θ

10

11 2.3. Fu 1994

12 2. Estimating θ 2.3. Fu 1994

13 2. Estimating θ Fu’s other estimators: i-Mutation: only good with large n UPBLUE: does not generalize to settings with recombination

14 3. Estimating ρ Estimating ρ is hard, both statistically and computationally. – Using infinite side model, all mutation events can be listed whereas all recombination events cannot. A recombination can only be inferred with certainty if all four gametes are present. – The number of possible genealogical relationship between sequences subject to recombination is unlimited.

15 3. Estimating ρ Both T M and H M might overlook important information in the data. They only provides lower bounds to the number of recombination events. (T M: Least number of gene trees required to explain the sample. H M : 5.11.3) Assume two non-recombining loci with rate ρbetween them. Likelihood function: L(θ1, θ2, ρ) = P θ1, θ2, ρ (S 1,n =k1,S 2,n =k2) For n=2, the two extreme cases: L(θ1, θ2, 0) = L(θ1, θ2, ∞) =

16 3. Estimating ρ Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and k2=7, L(θ1, θ2, 0)=0.0014 L(θ1, θ2, ∞) =0.0067 (θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞) The likelihood supports two unlinked loci(ρ= ∞) more than two completely linked loci(ρ=0).  ρ>0 even though the data passes the four gamete test and has T M = H M = 0.  recombination can be inferred even in the absence of incompatibilities. If k1=k2=4, the likelihood supports two complete linked loci.

17 3. Estimating ρ Recombination is difficult to take into account in an analysis and it is tempting to ignore it or assume that its effects are minor. Unfortunately it is not true. 3.1 Estimators based on summary statistics – Wakeley’s estimator: Wakeley(1997) A complicated function of a complicated function of ρ. ( the form is fully know though ) Large variance. The expectation doesn’t strongly depend on ρ. – Likelihood and summary statistics: Wall(2000) Infer ρ based on the likelihood of (Sn, Kn, T M ) where Kn is the number of haplotypes in the sample.

18 3. Estimating ρ 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead & Donnelly(2001) – Consider all pairs of segregating sites, ignoring all non-polymorphic sites – Let n ij denote the vector of gamete counts for sites i and j (00,01,…). Let ρ ij be the scaled recombination rate: – Probability of obtaining n ij given that both i,j are polymorphic – The proposed pseudo-likelihood function is: – Then to estimate ρ from this pseudo-likelihood. It depends on ρ only. Because ρ ij is assumed proportional to sequence length.

19 3. Estimating ρ It is pseudo: – 1.Only likelihood of pairs of segregating sites are considered – 2. Pairs are treated as independent of each other – 3.The likelihood of a pair is conditioned on the pair being segregating in both loci

20 4. Monte Carlo methods The principle: throw dice several times and calculate the average – Var(g(X))/M Use Monte Carlo integration to find P(Sn=k) : A naïve approach – Simulate genealogies of n genes, add mutations and count the number of times a genealogy has exactly k mutations. – Many simulated genealogies will not contribute to the sum.

21 4. Monte Carlo methods A better approach: Write P(Sn=k) in the form of an integral – Rearrange the terms and we can get: Where X is gamma distributed with parameters k+1 and θ/2 – Or: Where Ln is the sum of all branches in the coalescent tree. – It’s better because every simulated values counts. 2016-6-5Comp 790– Continuous-Time Coalescence21

22 4. Monte Carlo methods 2016-6-5Comp 790– Continuous-Time Coalescence22

23 4. Monte Carlo methods 4.1 Likelihood curve – Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a whole range of θ values – E.g.: to calculate L(θ) = P θ (Sn=k) for a large range of θs and single out the θ value with highest probability. – Recall: – Simulate y1,….y M from Ln and calcuate the empirical average: – Note that only one set of simulations is performed and is used to calculate the likelihood for all θ 2016-6-5Comp 790– Continuous-Time Coalescence23

24 4. Monte Carlo methods Alternatively, recall: one can extend it this way: consider some fixed θ 0. for any θ Using Monte Carlo technique, the integral can be approximated by: Where x1,…x M are M values obtained from the proposal distribution gamma(k+1, θ 0 ). θ 0 is called the ‘driving value’. An appropriate choice of θ 0 could be a simple estimator of θ. E.g. Watterson’s estimator. 2016-6-5Comp 790– Continuous-Time Coalescence24

25 4. Monte Carlo methods 2016-6-5Comp 790– Continuous-Time Coalescence25

26 4. Monte Carlo methods 4.2 Monte Carlo integration and the coalescent – Full likelihood of a sample under a coalescent model: – H: historyD: data – Let’s define H here: – N!(n-1)!/2^(n-1) different coalescent topologies – Impossible to sum up all these. ( we haven't even considered recomb) 2016-6-5Comp 790– Continuous-Time Coalescence26

27 4. Monte Carlo methods A naïve Monte Carlo approach: – It’s not efficient for most of the coalescent topologies will not be compatible with D.  most simulations do not contribute to the likelihood. – A four sequence example: (1/3 compatible ) 2016-6-5Comp 790– Continuous-Time Coalescence27

28 4. Monte Carlo methods Importance Sampling: – Reduce the variance of the estimated probability – Reduce the number of simulations that contribute little to likelihood Instead of choosing histories from distribution P θ (H), sample histories from a proposal distribution Q(H) Now the likelihood of data can be approximated by: 2016-6-5Comp 790– Continuous-Time Coalescence28

29 4. Monte Carlo methods Ideally, one would like to sample from, where because in that case the approximation becomes exact: Not feasible approach. A proposal distribution between and Giffiths and Tavare(1994), Stephens and Donnelly(2000) 2016-6-5Comp 790– Continuous-Time Coalescence29

30 4. Monte Carlo methods Giffiths and Tavare(1994): Let’s go back to 2.4.2 Infinite Site Model 2016-6-5Comp 790– Continuous-Time Coalescence30

31 Giffiths and Tavare(1994): H is defined as a path through the diagram. H has probability defined by the product of weights attached to the edges that belong to H. E.g.H’ follows the rightmost path : – 1 st term of Q(H’): – θ 0 is the driving value – Last five terms are all 1 2016-6-5Comp 790– Continuous-Time Coalescence31

32 4. Monte Carlo methods P θ (D|H’) = 1 ??? is the product of coalescent probabilities of the events defining the history: – Coalescent -> mutation -> coalescent … The factor in front of a fraction is the probability that a mutation happens in a given lineage(s) or that a coalescent event happens amongst certain pair of genes. 2016-6-5Comp 790– Continuous-Time Coalescence32

33 4. Monte Carlo methods 4.3 Markov Chain Monte Carlo – Kuhner et al.(1995,1998) – Finite sites model – All coalescent topologies are compatible with data – Likelihood ratio: 2016-6-5Comp 790– Continuous-Time Coalescence33

34 4. Monte Carlo methods The importance sampling function is: Use Metropolis-Hastings algorithm to construct a Markov Chain with distribution Q(H) The benefit of the approach is that the Markov Chain tends to stay in areas of the tree space that suport the data well before moving to another area. 2016-6-5Comp 790– Continuous-Time Coalescence34


Download ppt "Getting Parameters from data 2016-6-5Comp 790– Coalescence with Mutations1."

Similar presentations


Ads by Google