D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
CS 336 March 19, 2012 Tandy Warnow.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
1 Decomposing Hypergraphs with Hypertrees Raphael Yuster University of Haifa - Oranim.
An introduction to maximum parsimony and compatibility
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination Dan Gusfield UC Davis Different parts of this work are joint.
Erice - Structured Pattern Detection and Exploitation Deterministic Algorithms.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Graph Drawing and Information Visualization Laboratory Department of Computer Science and Engineering Bangladesh University of Engineering and Technology.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial)
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Haplotyping via Perfect Phylogeny: A Direct Approach
ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Multi-State Perfect Phylogeny via Chordal Graph Theory Dan Gusfield UC Davis December 7, UCLA.
Chapter 11: Limitations of Algorithmic Power
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
GRAPH Learning Outcomes Students should be able to:
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
1 Decomposition into bipartite graphs with minimum degree 1. Raphael Yuster.
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Flow in Network. Graph, oriented graph, network A graph G =(V, E) is specified by a non empty set of nodes V and a set of edges E such that each edge.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Donghyun (David) Kim Department of Mathematics and Computer Science North Carolina Central University 1 Chapter 7 Time Complexity Some slides are in courtesy.
NPC.
Constructing a level-2 phylogenetic network from a dense set of input triplets Leo van Iersel 1, Judith Keijsper 1, Steven Kelk 2, Leen Stougie 12 (1)
1 Finding a decomposition of a graph T into isomorphic copies of a graph G is a classical problem in Combinatorics. The G-decomposition of T is balanced.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Yufeng Wu and Dan Gusfield University of California, Davis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Erice - Structured Pattern Detection and Exploitation
The minimum cost flow problem
Chapter 5. Optimal Matchings
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Presentation transcript:

D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters

Alternative Title The Continuing Role of Incompatibility Graphs in the Study of Phylogenetic Networks

Geneological or Phylogenetic Networks The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations. The results also have phylogenetic applications, for example in hybrid speciation, lateral gene transfer.

Reconstructing the Evolution of Binary Bio-Sequences (SNPs) Perfect Phylogeny (tree) model Phylogenetic Networks (DAG) with recombination (ARG) Blobbed Trees Incompatibility Graph and Connected its Components Prior uses of Connected Components Decomposition Theorem and Proof Sketch Optimality Conjecture and Progress

The Perfect Phylogeny Model for binary sequences sites Ancestral sequence Extant sequences at the leaves Site mutations on edges The tree derives the set M: one mutation per site

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test When can a set of sequences be derived on a perfect phylogeny?

A richer model added Pair 4, 5 fails the four gamete-test. The sites 4, 5 are ``incompatible” Real sequence histories often involve recombination.

The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). P S Sequence Recombination A recombination of P and S at recombination point 5. Single crossover recombination Called ``crossing over” in genetics

Network with Recombination new The previous tree with one recombination event now derives all the sequences. 5 P S

Multiple Crossover Recombination 4-crossovers 2-crossovers = ``gene conversion”

Elements of a Phylogenetic Network (single crossover recombination) Directed acyclic graph. Integers from 1 to m written on the edges. Each integer written only once. These represent mutations. A choice of ancestral sequence at the root. Every non-root node is labeled by a sequence obtained from its parent(s) and any edge label on the edge into it. A node with two edges into it is a ``recombination node”, with a recombination point r. One parent is P and one is S. The network derives the sequences that label the leaves.

A Phylogenetic Network S p P S 1 4 a:00010 b:10010 c: d:10100 e: f:01101 g:

Minimizing Recombinations Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. Problem: Given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations. NP-hard (Wang et al 2000, Semple et al 2004)

Decomposition can help First we introduce the viewpoint needed.

Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP- hard. (Wang et al 2000) (Semple 2004) A super-exponential-time method computes the exact min (Song and Hein). Works only for a small number of sequences. Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. in Gusfield, Eddhu, Langley O(nm + n^3) time (also root-unknown case solved later in the same time bound).

Blobs in Networks In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. The cycle specified by those two paths is called a ``recombination cycle”. In a phylogenetic Network a maximal set of (edge) intersecting cycles is called a blob.

A Phylogenetic Network with one Blob S p P S 1 4 a:00010 b:10010 c: d:10100 e: f:01101 g:

Blobbed-trees Contracting each blob to a single node results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.

Ugly tangled network inside the blob. Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship?

Incompatible Sites Recall, a pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

abcdefgabcdefg Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. Incompatibility Graph G(M) M G(M) has two connected components.

The connected components of G(M) are very informative The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson). When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al )

Galled-Tree Structure So when each blob contains only a single cycle, there is a one-one correspondence between the blobs and the non-trivial connected components of the incompatibility graph. This is the central fact used in polynomial-time solutions to the (NP- hard) recombination minimization problem, when a galled-tree for M exists. Motivating Question: To what extent does this clean one-one structure carry over to general phylogenetic networks? How do we exploit the general structure?

The Decomposition Theorem (Recomb 2005) For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. A blobbed-tree with this structure is called fully-decomposed.

General Structure So, for any set of sequences M, there is a phylogenetic network where there is a one-one correspondence between the blobs and the non- trivial connected components of G(M). Moreover, the tree part of T(M) is unique. And it is easy to find the tree part.

a: b: d: c: e: f: g: A fully-decomposed network for the sequences generated by the prior network. 2 4 p s p s Incompatibility Graph

Proof Ideas Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.

abcdefgabcdefg 13 4 M a b c d e f g C1 25 C2 abcdefgabcdefg B1 B2 M[C1]M[C2]

Faux Proof Pick one site from each connected component C in G(M) to ``represent” C. No pair of those sites are incompatible, so by the NASC for a perfect phylogeny, there will be a perfect phylogeny T for the sites. Expand each node to a network generating the sequences in M[C]. Incorrect, because the structure of T can be wrong. We need to use information about all the sites in each C.

a b c d e f g abcdefgabcdefg M[C1]M[C2] abcdefgabcdefg W Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicator matrix for the supercharacters. So W indicates which rows of M contain which particular supercharacters.

Proof Ideas Lemma: No pair of supercharacters are incompatible. So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.

Proof Ideas For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.

Algorithmically, T is easy to find and is the tree resulting from contracting each blob in the fully-decomposed blobbed-tree T(M) for M. T can be constructed from M in O(nm^2) time.

Broader Biological Applications Our major interest is in recombination, but the proof of the decomposition theorem does not explicitly use recombination. So it holds for whatever biological phenomena caused the incompatibility of sites. For example, back or recurrent mutation, gene- conversion, lateral gene transfer etc.

What is the most tree-like network? Simple definition: The ``treeness’’ of a network is the number of edges in the tree after contracting each blob to a single node. Simple fact: In any phylogenetic network N for M, all sites from a single non-trivial connected component must be together in a single blob of N. Hence, under this simple definition, a fully- decomposed blobbed tree is the most tree-like network for M.

The supercharacters from M play the role in phylogenetic networks that normal binary characters play in perfect phylogeny trees. So supercharacters are the fundamental characters of phylogenetic networks.

Algorithmically Finding the tree part of the blobbed-tree is easy. Determining the sequences labeling the exterior nodes on any blob is easy. Determining a “good” structure inside a blob B is the problem of generating the sequences of the exterior nodes of B. It is easy to test whether the exterior sequences on B can be generated with only a single (possibly multiple- crossover) recombination. The original galled-tree problem is now just the problem of testing whether one single- crossover recombination is sufficient for each blob.

The main open question The Decomposition Theorem says there is always a fully-decomposed blobbed-tree for any M, but Is there always a fully-decomposed blobbed-tree that minimizes the number of recombinations over all possible phylogenetic networks for M?

We conjecture the answer is yes. If true, then we can decompose the problem of minimizing the total number of recombinations into separate problems on each connected component, and also find lower bounds on the needed number of recombinations, in each component separately, adding those bounds to get a valid overall lower bound for M. This computation of lower bounds is known to be correct for certain lower bounds (Bafna, Bansal 2004).

Progress on Proving the Conjecture Definition: If N is a phylogenetic network for M, and a node v in N is labeled with a sequence in M, then v is said to be visible in N. Theorem: If every node in N is visible, then there is a fully-decomposed network for M where the number of recombinations is at most the number in N. Corollary: The conjecture is true for any M where the Haplotype or History lower bounds (S. Myers) on the number of recombinations needed to generate M, is tight.

Papers and Software wwwcsif.cs.ucdavis.edu/~gusfield/