Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1.

Slides:



Advertisements
Similar presentations
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Advertisements

Introduction to Haplotype Estimation Stat/Biostat 550.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Sampling distributions of alleles under models of neutral evolution.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Tree Reconstruction.
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Islands in Africa: a study of structure in the source population for modern humans Rosalind Harding Depts of Statistics, Zoology & Anthropology, Oxford.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Lecture 5: Learning models using EM
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Continuous Coalescent Model
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
Phylogenetic Estimation using Maximum Likelihood By: Jimin Zhu Xin Gong Xin Gong Sravanti polsani Sravanti polsani Rama sharma Rama sharma Shlomit Klopman.
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Computer vision: models, learning and inference
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Artificial Intelligence in Game Design N-Grams and Decision Tree Learning.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
California Pacific Medical Center
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Yufeng Wu and Dan Gusfield University of California, Davis
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
L4: Counting Recombination events
Recitation 5 2/4/09 ML in Phylogeny
Estimating Recombination Rates
Statistical Modeling of Ancestral Processes
CSCI 5822 Probabilistic Models of Human and Machine Learning
Recombination, Phylogenies and Parsimony
Trees & Topologies Chapter 3, Part 2
Trees & Topologies Chapter 3, Part 2
CSE 373 Data Structures and Algorithms
The Most General Markov Substitution Model on an Unrooted Tree
Outline Cancer Progression Models
Presentation transcript:

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA

Coalescent Likelihood D: a set of binary sequences. Coalescent genealogy: history with coalescent and mutation events. Coalescent likelihood P(D): probability of observing D on coalescent model given mutation rate  Assume no recombination Coalescent Mutation

Perfect Phylogeny Infinite many sites model of mutations: one mutation per site in history Perfect phylogeny –Site labels tree branches –Each site appears exactly once. –Sequence: list of mutations from root to leaf. –Unique topologically, except root is unknown and order of mutations on the same branch is not fixed

Genealogy and Perfect Phylogeny Perfect phylogeny: not enough timing information –Exists many coalescent genealogy for a fixed perfect phylogeny. –Each genealogy: different probability (depending on  ) Coalescent likelihood: sum over all compatible genealogy

Computing Coalescent Likelihood Computation of P(D): classic population genetics problem. Statistical (inexact) approaches: –Importance sampling (IS): Griffiths and Tavare (1994), Stephens and Donnelly (2000), Hobolth, Uyenoyama and Wiuf (2008). –MCMC: Kuhner,Yamato and Felsenstein (1995). Genetree: IS-based, widely used but (sometimes large) variance still exists. How feasible of computing exact P(D)? –Considered to be difficult for even medium-sized data (Song, Lyngso and Hein, 2006). This talk: exact computation of P(D) is feasible for data significantly larger than previously believed. –A simple algorithmic trick: dynamic programming 5

Ethier-Griffiths Recursion Build a perfect phylogeny for D. Ancestral configuration (AC): pairs of sequence multiplicity and list of mutations for each sequence type at some time Transition probability between ACs: depends on AC and . Genealogy: path of ACs (from present to root) P(D): sum of probability of all paths. EG: faster summation, backwards in time. (1, 0), (3, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (1, 0), (1, 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (1, 0), (2, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (1, 0), (1, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0) (3, 4 0) 6

Computing Exact Likelihood Key idea: forward instead of backwards –Create all possible ACs reachable from the current AC (start from root). Update probability. –Intuition of AC: growing coverage of the phylogeny, starting from root Possible events at root: three branching (b 1, b 2, b 3 ), three mutations (m 1, m 2, m 4 ). Branching: cover new branch Covered branch can mutate Each event: a new AC b1b1 m2m2 b2b2 Start from root AC 7

Forward Finding of ACs Finding all ACs by forward looking. Maintain a list of active events in each AC. Update in the new ACs. Rule: at a node in phylogeny, mutated branches  covered branches (unless all branches are covered) Mut branch = covered branch 8

Why Forward? Bottleneck: memory Layer of ACs: ACs with k mutation or branching events from root AC, k= 1,2,3… Key: only the current layer needs to be kept. Memory efficient. A single forward pass is enough to compute P(D). 9 Coalescent Mutation

Results on Simulated Data Use Hudson’s program ms: 20, 30, 40 and 50 sequences with  = 1, 3 and 5. Each settings: 100 datasets. How many allow exact computation of P(D) within reasonable amount of time? Number of sequences % of feasible data Number of sequences Ave. run time (sec.) for feasible data 10

Application: MLE of Mutation Rate Given a set of sequences D, what is the maximum likelihood estimate of the mutation rate  ? Issue: Need to compute for many possible  and root of genealogy is not known. Use exact likelihood –Compute P(D |  ) for  on a grid. –MLE of  : maximize P(D |  ). –Full likelihood: sum P(D |  ) over all possible roots. –Faster computation: compute P(D) for multiple  in one pass. Quicker than computing P(D) for one  each time. 11  P(D) MLE(  )

A Mitochondrial Data Mitochondrial data from Ward, et al. (1991). Previously analyzed by Griffiths and Tavare (1994) and others. –55 sequences and 18 polymorphic sites. –Believed to fit the infinite sites model. 12

MLE of Mutation Rate for the Mitochondrial Data MLE of  : 4.8 Griffiths and Tavare (1994) IS methods can have variance –Is 4.8 really the MLE? Compute the exact full likelihood for a grid of  between 4.0 to

Conclusion IS seems to work well for the Mitochondrial data –However, IS can still have large variance for some data. –Thus, exact computation may help when data is not very large and/or relatively low mutation rate. –Can also help to evaluate different statistical methods. Research supported by National Science Foundation. 14