by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal

a decomposition theory for phylogenetic networks and incompatible characters
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal april 22, 2008

primary goal prove that
any given set of taxa represented by sequences of binary characters can be derived on a (fully decomposed) phylogenetic network, with recombination, but without homoplasy and give a polynomial time algorithm to construct such a network C:00100 D:10100 E: 01100 F: 01101 G:00101 a: 00010 b: 10010 c: 00100 d: 10100 e: 01100 f: 01101 g: 00101 c a b d e f g

in more detail … A:00010 B:10010 C:00100 D:10100 E: 01100 F: 01101 G:00101 a: 00010 b: 10010 c: 00100 d: 10100 e: 01100 f: 01101 g: 00101 :- analyze the structure of incompatibilities among characters of the input sequences :- construct a tree that encodes the maximum amount of compatibility among characters and ‘localizes’ their incompatibilities in its nodes. :- inflate nodes of the tree into a (fully-decomposed) phylogenetic network that derives the input sequences through recombination, but without homoplasy 1 2 3 4 5 g f e d b a c :- find some conditions under which the network is optimal (i.e., contains minimum number of recombination nodes) :- show that the network can be sub-optimal by any given amount; give some sub-optimal examples c a b d e f g :- through simulations get an idea of how often the network achieves optimality in practice

phylogenetic networks under consideration

a phylogenetic network
:- a rooted acyclic directed graph :- leaves labeled by given sequences 00000 00010 00100 01100 00101 10010 :- internal nodes have either one incoming edge (ordinary node/edge) or two (recombination node/edges) ordinary 4 3 1 5 2 :- each ordinary edge labeled by a character; each character labels precisely one such edge, indicating that it changes along the edge a: 00010 b:10010 d:10100 e: 01100 f: 01101 g: 00101 c: 00100 :- each recombination node labeled by a sequence of one (single cross-over) or more (multiple cross-over) characters with one incoming edge labeled p, indicating the sequence of switches between two parent sequences (starting at the sequence from edge p) needed to derive the child sequence 4 2 p recombination :- each character evolves without homoplasy along a maximal sub-tree in the network

blobs and blobbed trees
:- each recombination node is in a recombination cycle recombination cycles a b d e f g c 4 2 blob :- if two recombination cycles share only a node, adding an edge makes them disjoint; henceforth, any two cycles are either disjoint or share at least an edge :- a maximal set of pair-wise edge-sharing recombination cycles is a blob; any two blobs are disjoint :- contracting each blob to a single node yields a blobbed tree :- a node in a blobbed tree is either a blob or a node external to all recombination cycles in the originating network blobbed tree g f e d b a c :- each internal edge in a blobbed tree is an edge in the originating network external to all recombination cycles; the corresponding edge in the network is called a tree edge

another example phylogenetic network 00000000 00010000 00100000
blob 4 3 1 5 2 7 8 6 a: b: d: e: f: g: c: g f e d b a c blobbed tree 5,7 2 recombination nodes tree edge

analyzing sequences

character incompatibility and perfect phylogeny
:- in a set of binary sequences M a pair of characters i and j is incompatible if the sequences in M contain all four combinations 00, 01, 10 and 11 for the subsequence ij a: 00010 b: 10010 c: 00100 d: 10100 e: 01100 f: 01101 g: 00101 M: :- if an ancestral (root) sequence S is specified in addition to set of leaf sequences M, pair of characters i and j is incompatible relative to S if it is incompatible in M  {S}. :- (perfect phylogeny theorem) there is a phylogenetic network without recombination nodes (i.e., a phylogenetic tree without homoplasy) that derives M (resp. from root sequence S) iff there are no incompatible pairs of characters in M (resp. relative to S) incompatible pairs: (1, 3) (1, 4) (2, 5) :- (moral) character incompatibility can only be explained by (1) homoplasy (2) recombination, or (3) other reticulation phenomena :- part of the intent of the paper is to show that recombination by itself is sufficient to explain incompatibility

isolating incompatibilities: incompatibility graph
D:10100 E: 01100 F: 01101 G:00101 :- incompatibility graph of M, G(M), contains one node for each character, and an edge between two nodes if the corresponding pair of characters is incompatible a: 00010 b: 10010 c: 00100 d: 10100 e: 01100 f: 01101 g: 00101 M: incompatible pairs: (1, 3), (1, 4), (2, 5) :- connected components of incompatibility graph isolate or localize incompatibilities 1 2 3 4 5 incompatibility graph: :- perfect phylogeny if no edges in incompatibility graph :- if an ancestral sequence S is specified, the conflict graph of M relative to S, GS(M), is the incompatibility graph of M  {S} connected components: {1,3,4}, {2,5}

the fundamental decomposition theorem
Let G(M) be the incompatibility graph of a set of sequences M. Then, there is a phylogenetic network N that derives M such that every blob in N contains as edge labels all and only the characters in a single non-trivial component of G(M), every compatible character labels one and only one tree edge of N. Result holds regardless of number of cross-overs at recombination nodes. :- it’s possible for a phylogenetic network to have blobs that contain characters from more than one component of G(M), but it is not possible for characters in one component to separate into two or more blobs; therefore, a network with properties (1) and (2) has the largest possible number of blobs and is called fully decomposed :- there is a 1-1 correspondence: conflict graph: f.d. phylogenetic network: blobbed tree: non-trivial components blobs trivial components tree and terminal edges components edges :- analogous result relative to a specified root sequence S also holds

still need to know… :- how to construct the promised network, i.e., how to assemble it from blobs and assign (1) input sequences to leaves (2) characters to edges and recombination nodes within blobs :- if the network is optimal, i.e., contains minimum possible number of recombination nodes the proof of the theorem is, in fact, constructive and gives a polynomial time algorithm for finding a network let’s look at the proof!

constructing networks

dominants sets, states and sequences
take two distinct components C, C’ of G(M): characters :- if characters i  C, i’  C’ bipartition sequences as (x|y), (x’|y’) resp., we can always arrange x  x’, y’  y ; x is the dominant class, and y, the dominated class of i with respect to i’ 1 3 4 2 5 a b d c e f g sequences :- (lemma 1) same class of i  C is dominant w.r.t. all classes i’ of C’; thus, each character in CC’ has well-defined dominant and dominated classes w.r.t. pair (C, C’) :- the state of i  C in its dominant class is its dominant state w.r.t. (C, C’) :- (theorem 2) in every sequence contained in all dominated classes of C’ w.r.t. C every character in C is in dominant state w.r.t. (C,C’); the corresponding sub-sequence is dominant sequence of C w.r.t. (C,C’) C C’ dominant class of 1: {a,c,e,f,g} dominated class of 1: {b,d} :-(sequence segregation) if M(C) is the set of sequences in M restricted to characters in C, each distinct sequence in M(C ) is either dominant w.r.t. some C’ or dominated w.r.t. all C’ dominant state of 1: 0 dominant sequence of C: 010 dominant sequence of C’: 00

super-characters :- a sequence in M(C) is a super-character of M associated to C sequences super-characters 001 101 010 110 00 10 11 01 a 1 b c d e f g :- matrix B indicates for each super-character of M associated to a component if a sequence in M contains it (1) or not (0) :- B is a matrix of sequences derived from M; its characters are super-characters of M :- (lemma 3) no pair of characters of B is incompatible - if p and q are two characters of B associated to the same component, they are compatible by construction; if they are in different components, (p,q) = (dominant/dominated, dominant/dominated)  (1/0,1/0) not present as subsequence (p,q) in B

(blobbed*) tree T :- by perfect phylogeny theorem there is a phylogenetic tree, T, deriving sequences of B without homoplasy tree T g f e c b a d 001 101 010 110 00 10 11 01 V{1,3,4} V{2,5} :- each edge of T is labeled by one or more super-characters of M :- for each component C in G(M) there is a unique node vC in T such that all edges labeled with super-characters associated to C are incident with vC; there may be other edges incident with vC as well :- any sub-tree of T containing vC that is obtained by removing an edge labeled with a super-character associated to C has the leaves for sequences of B with character C in state 0; vC is said to be on 0-side of the removed edge *strictly true only when network is efficient - see later

fully decomposed phylogenetic network
inflate T into a fully decomposed phylogenetic network N as follows: network N 4 3 1 5 2 a b d e f g c p 00000 :- select any node of T as root, vr, and direct all edges of T away from it; let sequence S be the label on vr - S will define the ancestral sequence of N :- incoming edge on node vC corresponding to a non-trivial component C is labeled by super-character S(C), which is S restricted to characters in C :- all super-characters associated to C can be derived from S(C ) using at most one mutation per character and recombinations; so, vC can be inflated into a blob bv containing one node labeled by each super-character of M(C ) and possibly other nodes :- connect each node in bv labeled by a super-character of M(C ) to the edge incident to vC and labeled by that character

notes on the construction
:- (algorithmic complexity) if n is the number of sequences and m, the number of characters, the time needed to build T is O(nm2) and time needed to build N is O(nm2+m3) :- (component-wise optimal decomposition - theorem 3) let R S(C)(M(C)) (resp. R1S(C)(M(C))) be the minimum number of recombination nodes needed to generate sequences in M(C) in a phylogenetic network with ancestral sequence S(C) when multiple (resp. single) cross-overs are allowed; for any ancestral sequence S there is a phylogenetic network deriving M, containing exactly CGS(M)R S(C) (M(C)) (resp. CGS(M) R1S(C)(M(C))) recombination nodes when multiple (resp. single) cross-overs are allowed. :- (uniqueness of blobbed tree - theorem 4) tree T is unique and same as the blobbed tree for efficient networks; these are networks in which each external node (a node in a blob with an edge to a node outside the blob) has only one edge to a node outside the blob, and no two external nodes have the same labels. :- (alternative uses of blobbed tree) tree T depends only on the component structure of the incompatibility graph, and not on recombination; it can, therefore, be used to construct other derivations of M, e.g., maximum parsimony tree

optimality of fully decomposed networks

summary of optimality results
:- a fully decomposed phylogenetic network is optimal if it has the minimum number of nodes among all phylogenetic networks :- various sufficient conditions for a fully decomposed phylogenetic network to be optimal are derived :- not all fully decomposed networks are optimal; two examples are given and it is shown that arbitrarily large departure from optimality can be realized in some network :- simulation study suggests that in practice fully decomposed networks are likely to be optimal or nearly so

some sufficient conditions for optimality
:- (spatial disjoint-ness) given an ordering of the characters of M and an ancestral sequence S, M is spatially disjoint w.r.t. S if the characters in every connected component of GS(M) form a contiguous interval in the ordered set of sites; in this case an optimal fully decomposable phylogenetic network deriving M from S using single cross-overs exists :- (component respect) let N be a phylogenetic network deriving M from S using R recombination nodes; if addition of node labels of N other than those in M  {S} does not create any incompatibilities between characters in different components of GS(M), N is said to respect components of GS(M); in this case there is a fully decomposed phylogenetic network deriving M from S using no more than R recombination nodes and same type of cross-overs as N :- (tight haplotype bound) if H(M  {S}) := #rows of M  {S} - #distinct columns of M  {S} - 1 is equal to optimum number of recombination nodes using single (resp. multiple) cross-overs, then an optimal fully decomposed phylogenetic network deriving M from S using single (resp. multiple) cross-overs exists

deviation from optimality
:- (unbounded deviation - theorem 11) let R1(M) denote the minimum number of recombination nodes in any phylogenetic network deriving M using single cross-overs, and F1(M), the corresponding minimum in any fully decomposed one; for any positive integer d, there exists a set of sequences Md such that F1(Md) - R1(Md)  d :- proof uses the universal example constructed from the set of sequences M used in the following slide: 001000 011010 010010 000100 100101 100001 M = Md = [M M … M] d copies

sub-optimal example with single cross-overs
sequences optimal phylogenetic network p 4 3 1 5 2 6 001000 000000 000100 011010 010010 100001 100101 001000 011010 010010 000100 100101 100001 incompatibility graph C1 2 1 3 5 4 6 C2 # recombinations in a fully decomposed network > 3

sub-optimal example with multiple cross-overs
sequences optimal phylogenetic network a: b: c: d: e: f: g: h: i: j: k: l: m: 3 1 4 z y x 8 6 7 2 9 10 5 incompatibility graph C1 10 9 8 7 6 1 2 3 4 5 X: © = y: © = Z: © = C2 # recombinations in a fully decomposed network > 3

simulation study of optimality
#data sets for each condition = 10,000 # taxa # characters recombination rate #data sets with #blobs >1 #sub-optimal cases 15 20 10 389 30 465 2 527 546 3 372 1 388 482 396

in conclusion …

summary of paper’s results
:- proves that a fully decomposed phylogenetic network deriving arbitrary binary sequence data always exists :- gives a polynomial time algorithm for constructing such networks :- extracts the underlying maximal tree structure which can be used to construct other ‘explanations’ of the data (e.g., maximum parsimony, maximum compatibility) :- finds some sufficient conditions (usually obtained in practice) when a fully decomposed network is optimal :- proves that deviation from optimality can be arbitrarily large and gives some examples of sub-optimal networks :- shows via simulations that in practice sub-optimality may be rare, and deviations from optimality small

future directions :- drop the binary sequence assumption
:- calculate how many fully decomposed networks there are and how exactly they differ :- find a necessary and sufficient condition for optimality :- find how to modify a sub-optimal fully decomposed network to get an optimal network :- find algorithms to convert blobbed trees into maximum parsimony phylogenetic trees

by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal

Similar presentations

Presentation on theme: "by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal

Similar presentations

Presentation on theme: "by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal"— Presentation transcript:

Similar presentations

About project

Feedback