Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
An introduction to maximum parsimony and compatibility
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Background The demographic events experienced by populations influence their genealogical history and therefore the pattern of neutral polymorphism observable.
Recombination and genetic variation – models and inference
Sampling distributions of alleles under models of neutral evolution.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
Molecular Evolution Revised 29/12/06
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Phylogenetic Trees: Assumptions All existing species have a common ancestor Each species is descended from a single ancestor Each speciation gives rise.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
CSE182-L17 Clustering Population Genetics: Basics.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Incorporating Mutations
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Probabilistic methods for phylogenetic trees (Part 2)
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Phylogenetic trees Sushmita Roy BMI/CS 576
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Phylogeny Ch. 7 & 8.
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Yufeng Wu and Dan Gusfield University of California, Davis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Multiple Alignment and Phylogenetic Trees
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Outline Cancer Progression Models
Presentation transcript:

Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA Dagstuhl Seminar, 2010

2 Recombination One of the principle genetic forces shaping sequence variations within species Two equal length sequences generate a third new equal length sequence in genealogy Spatial order is important: different parts of genome inherit from different ancestors Prefix Suffix Breakpoint

Ancestral Recombination Graph (ARG) S1 = 00 S2 = 01 S3 = 10 S4 = 10 Mutations S1 = 00 S2 = 01 S3 = 10 S4 = Recombination Network model: beyond tree model Assumption: At most one mutation per site

4 Reconstruction of Network-based Evolutionary History Input: DNA sequences (haplotypes) or phylogenetic trees Biology: meiotic recombination in populations, or reticulate evolutionary processes: horizontal gene transfer or hybrid speciation Different formulation Reconstruct the network-based evolutionary history (and related problems) Efficiency Accuracy Same objective

Reconstructing ARGs by Parsimony Input: a set of binary sequences M Goal: reconstruct ARGs deriving M Parsimony formulation –minARG: Minimize the number of recombination events –NP complete (Wang, et al) 5 Kreitman’s data for adh locus of D. Malonagaster (1983)

The minARG Problem Uniform sampling of minARGs by treating each minARG as equally likely (Wu) Estimating the range of minARGs: lower and upper bounds Structural constrained ARGs, e.g. galled trees (Wang, et al, Gusfield, et al). Simplified ARG topology Heuristic methods, e.g. program MARGARITA (Durbin, et al.), Song, et al., Parida, et al. Exact minARG by branch and bound (Lyngso, Song and Hein)

minARG for Kreitman’s data Challenge: accurate inference of ARGs R min : minimum number of recombination for M. L(M): lower bound on R min U(M): upper bound on R min Several lower bounds give L(M)=7. U(M)=7 for Kreitman’s data (Song, Wu and Gusfield). Thus, R min (M)=7

8 ARG Induces Local Trees Local trees: evolutionary history at a genomic position. Trace backwards in time. At recombination node, pick the branch passing alleles to the recombinant at this location Data Local tree near site 3 Mutations Recombination

Local Trees Change Across the Genome Local trees change when moving across recombination breakpoints Data Local tree near site 2 Spatial property: Nearby local tree tends to be more similar. How good is the inferred ARGs? Compare the inferred local tree topologies with the simulated trees

Inferring Local Trees Problem: given binary sequences, infer local tree topologies (one tree for each site, ignore branch length) Parsimony-based approaches Hein (1990,1993), Song and Hein (2005) Wu (2010): shared topological features in nearby trees. Key: local trees have different topology due to recombination Trees or Network? Do not reconstruct full network; local trees are very informative Challenge: How to improve the accuracy? Accuracy: Robinson-Foulds distances between inferred trees and the simulated tree

RENT: REfining Neighboring Trees Maintain for each SNP site a (possibly non- binary) tree topology –Initialize to a tree containing the split induced by the SNP Gradually refining trees by adding new splits to the trees –Splits found by a set of rules (later) –Splits added early may be more reliable Stop when binary trees or enough information is recovered 11

A B C abcdeabcde M A Little Background: Compatibility Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. Easily extended to splits. Sites A and B are compatible, but A and C are incompatible.

Fully-Compatible Region: Simple Case A region of consecutive SNP sites where these SNPs are pairwise compatible. –May indicate no topology-altering recombination occurred within the region Rule: for site s, add any such split to tree at s. –Compatibility: very strong property and unlikely arise due to chance. 13 A B C

Split Propagation: More General Rule Three consecutive sites A,B and C. Sites A and B are incompatible. Does site C matter for tree at site A? –Trees at site A and B are different. –Suppose site C is compatible with sites A and B. Then? –Site C may indicate a shared subtree in both trees at sites A and B. Rule: a split propagates to both directions until reaching a incompatible tree. 14 A B C

Keep two red edges Keep two black edges Hybridization event: nodes with in-degree two or more ρ ρ ρ ρ T T’ Reticulate Networks Gene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted - Different topologies at different genes Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer Gene A 1: : : : Gene B 1: : : : Reticulate network: A directed acyclic graph displaying each of the gene trees

The Minimum Reticulation Problem Given: a set of K gene trees G. Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. NP complete: even for K=2 Current approaches: exact methods for K=2 case (see Semple, et al) impose topological constraints (e.g. galled networks, see Huson, et al.) T1T T2T2 T3T N Challenge: efficient and accurate reconstruction of reticulate network for multiple trees. Close lower and upper bounds for arbitrary number of trees (Wu, 2010)

Performance of PIRN: Optimal Solution Lower and upper bounds often match for many data 17 Horizontal axis: number of taxa Vertical axis: % of data LB=UB K: number of trees r: level of reticulation

Performance of PIRN: Gap of Bounds Gap between the lower and upper bounds is often small for many data 18 Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds K: number of trees r: level of reticulation

Reticulate Network for Five Poaceae Trees 19 rpoC2 phyB rbcL ndhF ITS Lower bound: 11 Upper bound: 13

Reticulate Network for Five Poaceae Trees 20 Upper bound: 13 used in this network

21 Acknowledgement More information available at: Research supported by National Science Foundation and UConn Research Foundation

Coalescent with Recombination Coalescent theory: define probabilistic distribution of genealogy Likelihood computation for coalescent with recombination Probability of ARGs under certain parameters Likelihood: summation of probability of all the ARGs Challenging: too many ARGs (Lyngso, Song and Hein) Importance Sampling approach: draw samples (ARGs) wrt some probablistic distribution Work well with no recombination Not working well with recombination

Coalescent-based ARG Sampling Uniform sampling of minARGs (Wu, 2007) Treat each minARG as equally likely. Algorithm for generating an minARG uniformly at random (exponential time for setting up, but polynomial-time in sampling) Probability of ARGs under certain parameters Challenge: develop a more general ARG sampling method that can efficiently sample ARGs approximately according to coalescent probabilities. minARG A related problem: compute coalescent likelihood with recombination efficiently. Recent work: exact computation of coalescent likelihood under infinite sites model with no recombination (Wu, 2009)

The Mosaic Model M: input sequences Assumption: input sequences are descendent of K founder sequences (unknown) Extant sequences: concatenation of exact copies of founder segment (no shift of position) Coloring: assign which position of a sequence is from which founder (color); need consistency M, K= breakpoint Total 5 breakpoint

The Minimum Mosaic Problem Problem: given a set of binary sequences and the number of founder K, find a K-coloring of these sequences to minimize the number of color change (recombination breakpoints) And find the K founder sequences (not part of input) Inferred founders Data from Rastas and Ukkonen 20 sequences 40 sites 55 breakpoints: minimum number of breakpoints

26 The Minimum Mosaic Problem Introduced by Ukkonen (2002) Simple and easier to visualize Main known results –An exponential-time algorithm which runs in polynomial- time algorithm for K=2 (Ukkonen 2002) –An exact method that works for relatively small K and modest-sized data (Wu and Gusfield, 2007) –Haplovisual program and other extensions by Rastas and Ukkonen (2007). –Heuristic algorithm by Roli and Blum (2009) –Lower bounds for the minimum number of breakpoints needed (Wu, 2010) Challenges –Polynomial-time algorithm for K  3? –Concrete applications in biology?