Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

ABC: Bayesian Computation Without Likelihoods David Balding Centre for Biostatistics Imperial College London (
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Introduction to Haplotype Estimation Stat/Biostat 550.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Gene tree analyses of Aboriginal Australians Rosalind Harding University of Oxford.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Recombination and genetic variation – models and inference
Sampling distributions of alleles under models of neutral evolution.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
Islands in Africa: a study of structure in the source population for modern humans Rosalind Harding Depts of Statistics, Zoology & Anthropology, Oxford.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Lecture 5: Learning models using EM
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Continuous Coalescent Model
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
California Pacific Medical Center
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Othello Artificial Intelligence With Machine Learning Computer Systems TJHSST Nick Sidawy.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
A coalescent computational platform to predict strength of association for clinical samples Gabor T. Marth Department of Biology, Boston College
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Minimal Recombinations Histories and Global Pedigrees Finding Minimal Recombination Histories Acknowledgements Yun Song - Rune Lyngsø - Mike Steel - Carsten.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Yufeng Wu and Dan Gusfield University of California, Davis
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
CSE 280A: Advanced Topics in Computational Molecular Biology
L4: Counting Recombination events
Estimating Recombination Rates
Statistical Modeling of Ancestral Processes
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Trees & Topologies Chapter 3, Part 2
Trees & Topologies Chapter 3, Part 2
CSE 373 Data Structures and Algorithms
Outline Cancer Progression Models
CSE 373: Data Structures and Algorithms
Presentation transcript:

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human Population-Genomics,

Coalescent Likelihood D: a set of binary sequences. Coalescent genealogy: history with coalescent and mutation events. Coalescent likelihood P(D): probability of observing D on coalescent model given mutation rate Assume no recombination. Infinite many sites model of mutations Coalescent Mutation

Computing Coalescent Likelihood Computation of P(D): classic population genetics problem. Statistical (inexact) approaches: –Importance sampling (IS): Griffiths and Tavare (1994), Stephens and Donnelly (2000), Hobolth, Uyenoyama and Wiuf (2008). –MCMC: Kuhner,Yamato and Felsenstein (1995). Genetree: IS-based, widely used but (sometimes large) variance still exists. How feasible of computing exact P(D)? –Considered to be difficult for even medium-sized data (Song, Lyngso and Hein, 2006). This talk: exact computation of P(D) is feasible for data significantly larger than previously believed. –A simple algorithmic trick: dynamic programming 3

Ethier-Griffiths Recursion Build a perfect phylogeny for D. Ancestral configuration (AC): pairs of sequence multiplicity and list of mutations for each sequence type at some time Transition probability between ACs: depends on AC and. Genealogy: path of ACs (from present to root) P(D): sum of probability of all paths. EG: faster summation, backwards in time. (1, 0), (3, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (1, 0), (1, 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (1, 0), (2, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (1, 0), (1, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (3, 4 0)

Computing Exact Likelihood Key idea: forward instead of backwards –Create all possible ACs reachable from the current AC (start from root). Update probability. –Intuition of AC: growing coverage of the phylogeny, starting from root Possible events at root: three branching (b 1, b 2, b 3 ), three mutations (m 1, m 2, m 4 ). Branching: cover new branch Covered branch can mutate Mutated branches covered branches (unless all branches are covered) Each event: a new AC b1b1 m2m2 b2b2 Start from root AC

Why Forward? Bottleneck: memory Layer of ACs: ACs with k mutation or branching events from root AC, k= 1,2,3… Key: only the current layer needs to be kept. Memory efficient. A single forward pass is enough to compute P(D). 6 Coalescent Mutation

Results on Simulated Data Use Hudsons program ms: 20, 30, 40 and 50 haplotypes with = 1, 3 and 5. Each settings: 100 datasets. How many allow exact computation of P(D) within reasonable amount of time? Number of haplotypes % of feasible data Number of haplotypes Ave. run time (sec.) for feasible data 7

Results on a Mitochondrial Data Mitochondrial data from Ward, et al. (1991). Previously analyzed by Griffiths and Tavare (1994) and others. –55 sequences and 18 polymorphic sites. –Believed to fit the infinite sites model. MLE of : 4.8 Griffiths and Tavare (1994) –Is 4.8 really the MLE? 8

Conclusion IS seems to work well for the Mitochondrial data –However, IS can still have large variance for some data. –Thus, exact computation may help when data is not very large and/or relatively low mutation rate. –Can also help to evaluate different statistical methods. Paper: in proceedings of ISBRA Research supported by National Science Foundation. 9