Getting Parameters from data 2016-6-5Comp 790– Coalescence with Mutations1.

Slides:



Advertisements
Similar presentations
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Advertisements

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Bayesian Estimation in MARK
Sampling distributions of alleles under models of neutral evolution.
Coalescence with Mutations Towards incorporating greater realism Last time we discussed 2 idealized models – Infinite Alleles, Infinite Sites A realistic.
N-gene Coalescent Problems Probability of the 1 st success after waiting t, given a time-constant, a ~ p, of success 5/20/2015Comp 790– Continuous-Time.
Atelier INSERM – La Londe Les Maures – Mai 2004
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Continuous Coalescent Model
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Simulation Output Analysis
Simulation of Random Walk How do we investigate this numerically? Choose the step length to be a=1 Use a computer to generate random numbers r i uniformly.
Extensions to Basic Coalescent Chapter 4, Part 1.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Extensions to Basic Coalescent Chapter 4, Part 2.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Lecture 3: population genetics I: mutation and recombination
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
A Little Intro to Statistics What’s the chance of rolling a 6 on a dice? 1/6 What’s the chance of rolling a 3 on a dice? 1/6 Rolling 11 times and not getting.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
L4: Counting Recombination events
Estimating Recombination Rates
CONCEPTS OF ESTIMATION
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The coalescent with recombination (Chapter 5, Part 1)
Trees & Topologies Chapter 3, Part 2
Trees & Topologies Chapter 3, Part 2
Outline Cancer Progression Models
Presentation transcript:

Getting Parameters from data Comp 790– Coalescence with Mutations1

1. Introduction No fun

2. Estimating θ Unlike ρ,θ does not shape the genealogy, it modifies genetic types Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters. More segregating sites -> More information

2. Estimating θ 2.1 Watterson’s estimator – One of the two very popular estimators – The estimator has mean θ:

2. Estimating θ The estimator is unbiased and variance decreases with increasing ρ: recombination breaks up linkage and reduces correlation between sites Under exponential growth, it is downwards biased. Under migration, it is upwards biased, because the MRCA tends to be pushed further back in time.

2. Estimating θ 2.2 Tajima’s estimator – π ij : The number of sites that differ between sequence i and j. – π ij has mean θ, because π ij is the number of segregating sites in a sample of size two. – As a consequence,

2. Estimating θ

Code for Watterson Estimator

2. Estimating θ

2.3. Fu 1994

2. Estimating θ 2.3. Fu 1994

2. Estimating θ Fu’s other estimators: i-Mutation: only good with large n UPBLUE: does not generalize to settings with recombination

3. Estimating ρ Estimating ρ is hard, both statistically and computationally. – Using infinite side model, all mutation events can be listed whereas all recombination events cannot. A recombination can only be inferred with certainty if all four gametes are present. – The number of possible genealogical relationship between sequences subject to recombination is unlimited.

3. Estimating ρ Both T M and H M might overlook important information in the data. They only provides lower bounds to the number of recombination events. (T M: Least number of gene trees required to explain the sample. H M : ) Assume two non-recombining loci with rate ρbetween them. Likelihood function: L(θ1, θ2, ρ) = P θ1, θ2, ρ (S 1,n =k1,S 2,n =k2) For n=2, the two extreme cases: L(θ1, θ2, 0) = L(θ1, θ2, ∞) =

3. Estimating ρ Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and k2=7, L(θ1, θ2, 0)= L(θ1, θ2, ∞) = (θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞) The likelihood supports two unlinked loci(ρ= ∞) more than two completely linked loci(ρ=0).  ρ>0 even though the data passes the four gamete test and has T M = H M = 0.  recombination can be inferred even in the absence of incompatibilities. If k1=k2=4, the likelihood supports two complete linked loci.

3. Estimating ρ Recombination is difficult to take into account in an analysis and it is tempting to ignore it or assume that its effects are minor. Unfortunately it is not true. 3.1 Estimators based on summary statistics – Wakeley’s estimator: Wakeley(1997) A complicated function of a complicated function of ρ. ( the form is fully know though ) Large variance. The expectation doesn’t strongly depend on ρ. – Likelihood and summary statistics: Wall(2000) Infer ρ based on the likelihood of (Sn, Kn, T M ) where Kn is the number of haplotypes in the sample.

3. Estimating ρ 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead & Donnelly(2001) – Consider all pairs of segregating sites, ignoring all non-polymorphic sites – Let n ij denote the vector of gamete counts for sites i and j (00,01,…). Let ρ ij be the scaled recombination rate: – Probability of obtaining n ij given that both i,j are polymorphic – The proposed pseudo-likelihood function is: – Then to estimate ρ from this pseudo-likelihood. It depends on ρ only. Because ρ ij is assumed proportional to sequence length.

3. Estimating ρ It is pseudo: – 1.Only likelihood of pairs of segregating sites are considered – 2. Pairs are treated as independent of each other – 3.The likelihood of a pair is conditioned on the pair being segregating in both loci

4. Monte Carlo methods The principle: throw dice several times and calculate the average – Var(g(X))/M Use Monte Carlo integration to find P(Sn=k) : A naïve approach – Simulate genealogies of n genes, add mutations and count the number of times a genealogy has exactly k mutations. – Many simulated genealogies will not contribute to the sum.

4. Monte Carlo methods A better approach: Write P(Sn=k) in the form of an integral – Rearrange the terms and we can get: Where X is gamma distributed with parameters k+1 and θ/2 – Or: Where Ln is the sum of all branches in the coalescent tree. – It’s better because every simulated values counts Comp 790– Continuous-Time Coalescence21

4. Monte Carlo methods Comp 790– Continuous-Time Coalescence22

4. Monte Carlo methods 4.1 Likelihood curve – Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a whole range of θ values – E.g.: to calculate L(θ) = P θ (Sn=k) for a large range of θs and single out the θ value with highest probability. – Recall: – Simulate y1,….y M from Ln and calcuate the empirical average: – Note that only one set of simulations is performed and is used to calculate the likelihood for all θ Comp 790– Continuous-Time Coalescence23

4. Monte Carlo methods Alternatively, recall: one can extend it this way: consider some fixed θ 0. for any θ Using Monte Carlo technique, the integral can be approximated by: Where x1,…x M are M values obtained from the proposal distribution gamma(k+1, θ 0 ). θ 0 is called the ‘driving value’. An appropriate choice of θ 0 could be a simple estimator of θ. E.g. Watterson’s estimator Comp 790– Continuous-Time Coalescence24

4. Monte Carlo methods Comp 790– Continuous-Time Coalescence25

4. Monte Carlo methods 4.2 Monte Carlo integration and the coalescent – Full likelihood of a sample under a coalescent model: – H: historyD: data – Let’s define H here: – N!(n-1)!/2^(n-1) different coalescent topologies – Impossible to sum up all these. ( we haven't even considered recomb) Comp 790– Continuous-Time Coalescence26

4. Monte Carlo methods A naïve Monte Carlo approach: – It’s not efficient for most of the coalescent topologies will not be compatible with D.  most simulations do not contribute to the likelihood. – A four sequence example: (1/3 compatible ) Comp 790– Continuous-Time Coalescence27

4. Monte Carlo methods Importance Sampling: – Reduce the variance of the estimated probability – Reduce the number of simulations that contribute little to likelihood Instead of choosing histories from distribution P θ (H), sample histories from a proposal distribution Q(H) Now the likelihood of data can be approximated by: Comp 790– Continuous-Time Coalescence28

4. Monte Carlo methods Ideally, one would like to sample from, where because in that case the approximation becomes exact: Not feasible approach. A proposal distribution between and Giffiths and Tavare(1994), Stephens and Donnelly(2000) Comp 790– Continuous-Time Coalescence29

4. Monte Carlo methods Giffiths and Tavare(1994): Let’s go back to Infinite Site Model Comp 790– Continuous-Time Coalescence30

Giffiths and Tavare(1994): H is defined as a path through the diagram. H has probability defined by the product of weights attached to the edges that belong to H. E.g.H’ follows the rightmost path : – 1 st term of Q(H’): – θ 0 is the driving value – Last five terms are all Comp 790– Continuous-Time Coalescence31

4. Monte Carlo methods P θ (D|H’) = 1 ??? is the product of coalescent probabilities of the events defining the history: – Coalescent -> mutation -> coalescent … The factor in front of a fraction is the probability that a mutation happens in a given lineage(s) or that a coalescent event happens amongst certain pair of genes Comp 790– Continuous-Time Coalescence32

4. Monte Carlo methods 4.3 Markov Chain Monte Carlo – Kuhner et al.(1995,1998) – Finite sites model – All coalescent topologies are compatible with data – Likelihood ratio: Comp 790– Continuous-Time Coalescence33

4. Monte Carlo methods The importance sampling function is: Use Metropolis-Hastings algorithm to construct a Markov Chain with distribution Q(H) The benefit of the approach is that the Markov Chain tends to stay in areas of the tree space that suport the data well before moving to another area Comp 790– Continuous-Time Coalescence34