Gil McVean Department of Statistics, Oxford Approximate genealogical inference.

Slides:



Advertisements
Similar presentations
Introduction to molecular dating methods. Principles Ultrametricity: All descendants of any node are equidistant from that node For extant species, branches,
Advertisements

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Biointelligence Laboratory, Seoul National University
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Bayesian Estimation in MARK
Dynamic Bayesian Networks (DBNs)
Sampling distributions of alleles under models of neutral evolution.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
Lecture 23: Introduction to Coalescence April 7, 2014.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Estimating parameters from data Gil McVean, Department of Statistics Tuesday 3 rd November 2009.
Islands in Africa: a study of structure in the source population for modern humans Rosalind Harding Depts of Statistics, Zoology & Anthropology, Oxford.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
Approximate Bayesian Methods in Genetic Data Analysis Mark A. Beaumont, University of Reading,
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Computer vision: models, learning and inference Chapter 10 Graphical Models.
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Recombination, and haplotype structure Simon Myers, Gil McVean Department of Statistics, Oxford.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
CS177 Lecture 10 SNPs and Human Genetic Variation
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Simon Myers, Gil McVean Department of Statistics, Oxford Recombination and genetic variation – models and inference.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Cosmological Model Selection David Parkinson (with Andrew Liddle & Pia Mukherjee)
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Markov Chain Monte Carlo in R
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Gil McVean Department of Statistics
Constrained Hidden Markov Models for Population-based Haplotyping
L4: Counting Recombination events
Estimating Recombination Rates
Vineet Bafna/Pavel Pevzner
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The coalescent with recombination (Chapter 5, Part 1)
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Gil McVean Department of Statistics, Oxford Approximate genealogical inference

Motivation We have a genome’s worth of data on genetic variation We would like to use these data to make inferences about multiple processes: recombination, mutation, natural selection, demographic history

Example I: Recombination In humans, the recombination rate varies along a chromosome Recombination has characteristic influences on patterns of genetic variation We would like to estimate the profile of recombination from the variation data – and learn about the factors influencing rate

Example II: Genealogical inference The genealogical relationships between sequences are highly informative about underlying processes We would like to estimate these relationships from DNA sequences We could use these to learn about history, selection and the location of disease- associated mutations

Modelling genetic variation We have a probabilistic model that can describe the effects of diverse processes on genetic variation: The coalescent Coalescent modelling describes the distribution of genealogical relationships between sequences sampled from idealised populations Patterns of genetic variation result from mapping mutations on the genealogy

Where do these trees come from? Present day

Ancestry of current population Present day

Ancestry of sample Present day

The coalescent: a model of genealogies time coalescence Most recent common ancestor (MRCA) Ancestral lineages Present day

Coalescent modelling describes the distribution of genealogies

…and data

Generalising the coalescent The impact of many different forces can readily be incorporated into coalescent modelling With recombination, the history of the sample is described by a complex graph in which local genealogical trees are embedded – called the ancestral recombination graph or ARG Ancestral chromosome recombines

Genealogical trees vary along a chromosome

Coalescent-based inference We would like to use the coalescent model to drive inference about underlying processes Generally, we would like to calculate the likelihood function However, there is a many-to-one mapping of ARGs (and genealogies) to data Consequently, we have to integrate out the ‘missing-data’ of the ARG This can only be done using Monte Carlo methods (except in trivial examples)

time

A problem and a possible solution Efficient exploration of the space of ARGs is a difficult problem The difficulties of performing efficient exact genealogical inference (at least within a coalescent framework) currently seem insurmountable There are several possible solutions –Dimension-reduction –Approximate the model –Approximate the likelihood function One approach that has proved useful is to combine information from subsets of data for which the likelihood function can be estimated –Composite likelihood

Example I: Recombination rate estimation We can estimate the likelihood function for the recombination fraction separating two SNPs To approximate the likelihood for the whole data set, we simply multiply the marginal likelihoods (Hudson 2001) The method performs well in point-estimation

Full likelihood Composite- likelihood approximation R  lnL R R

Good and bad things about CL Good things –Estimation using CL can be made very efficient –It performs well in simulations –It can generalise to variable recombination rates Bad things –It throws away information –It is NOT a true likelihood –It typically underestimates uncertainty because of ‘double-counting’

Fitting a variable recombination rate Use a reversible-jump MCMC approach (Green 1995) Merge blocksChange block size Change block rate Cold Hot SNP positions Split blocks

Composite likelihood ratio Hastings ratio Ratio of priors Jacobian of partial derivatives relating changes in parameters to sampled random numbers Acceptance rates Include a prior on the number of change points that encourages smoothing

rjMCMC in action 200kb of the HLA region – strong evidence of LD breakdown

How do you validate the method? Concordance with rate estimates from sperm-typing experiments at fine scale Concordance with pedigree-based genetic maps at broad scales

Strong concordance between fine-scale rate estimates from sperm and genetic variation Rates estimated from sperm Jeffreys et al (2001) Rates estimated from genetic variation McVean et al (2004)

We have generated a map of hotspots across the human genome Myers et al (2005)

We have identified DNA sequence motifs that explain 40% of all hotspots Myers et al (2008)

? Age of mutation Date of population founding Migration and admixture Example II: Estimating local genealogies

The decay of a tree by recombination

0 100 The decay of a tree by recombination

Two sequence case Any pair of haplotypes will have regions of high and low divergence We can combine HMM structures with numerical techniques (Gaussian quadrature) to estimate the marginal likelihood surface at a given position, x We can further approximate the likelihood surface by fitting a scaled gamma distribution –This massively reduces the computational load of subsequent steps –In the case of no recombination the truth is a scaled gamma distribution

Combining surfaces Suppose we have a partially-reconstructed tree We can approximate the probability of any further step in the tree using the composite-likelihood Pr( ) 0 t   } (assumes un-coalesced ancestors are independent draws from stationary distribution)

An important detail Actually, don’t use exactly this construction Use a ‘nearest-neighbour’ construction –Each lineage chooses a nearest neighbour –Choose which nearest-neighbour event to occur –Choose a time for the nearest-neighbour event Still uses composite likelihood

Building the tree We can use these functions to choose (e.g. maximise or sample) the next event The gamma approximation leads to an efficient algorithm for estimating the local genealogy that has the same time and memory complexity of neighbour-joining This mean it can be applied to large data sets

Desirable properties of the algorithm It can be fully stochastic (unlike NJ, UPGMA, ML) It returns the prior in the absence of data: It returns the truth in the limit of infinite data: It is correct for a single SNP: It is close to the optimal proposal distribution (as defined by Stephens and Donnelly 2000) in the case of no recombination It uses much of the available information It is fast – time complexity in n the same as for NJ

Example: mutation rate = recombination rate The true tree at 0.5 Simulations = 10, R = 10

How to evaluate tree accuracy? Specific applications will require different aspects of the estimated trees to be more or less accurate Nevertheless, a general approach is to compare the representation of bi-partitions in the true tree to estimated ones Rather than require 100% accuracy in predicting a bi-partition, we can (for every observed bi-partition) find the ‘most similar’ bi- partition in the estimated tree We should also weight by the branch length associated with each bi-partition

Simple distance weighting (UPGMA) Single sample from posterior Average weighted max r 2 = 0.59Average weighted max r 2 = 0.96 n = 100,  = 20, R = 30 (hotspots)

Open questions Obtaining useful estimates of uncertainty –Power transformations of composite likelihood function Using larger subsets of the data –E.g. quartets

Acknowledgements Many thanks to Oxford statistics –Simon Myers, Chris Hallsworth, Adam Auton, Colin Freeman, Peter Donnelly Lancaster –Paul Fearnhead International HapMap Project