. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran
2 The underlying model: A character-vector is given for every specie in S. Each character represents some observable trait. Each character takes values from a finite set. Basic Underlying Assumption: characters are homoplasy free. Perfect Phylogeny
3 no reversals Homoplasy-Free Characters no convergence Homoplasy-free characters induce a convex coloring of the phylogenetic tree The Perfect Phylogeny Problem: Given character-vectors for S, find: -a phylogenetic tree T over S. ( S is the leaf-set of T ) -convex character assignments to all vertices of T. ! This problem is generally NP-hard ! If exists
4 Directed binary characters: 0 – property exists 1 – property doesn’t exist Initially (at the root) all propertied do not exist. Input: binary coloring ( C 1,…, C m ) of a set S ( n x m binary matrix M ) Problem: Find a phylogenetic tree T over S (if one exists), s.t. 1.For j=1,…,m, the partial coloring induced by C j is convex in T. 2.The root has state 0 in all characters. Directed Binary Perfect Phylogeny We will present a polynomial-time solution
5 A E D C B (11000) (00100) (01000) (00110) (11001) m characters n species Example C 1 C 2 C 3 C 4 C 5 A B C D E Input: Possible output: (00000) (11000) (01000) (00100) C2C2 C3C3 zero-root
6 A tree is a directed perfect phylogeny for a given 0/1 matrix iff we can map each character to an edge/vertex on which this character was “turned on”. C 1 C 2 C 3 C 4 C 5 A B C D E A E D C B C4C4 C3C3 C1C1 C5C5 Example: An Important Observation C2C2 origin of C 2
7 Laminar Matrices Definitions: O j – set of objects that have character C j ( O j ={i : M ij =1} ). A collection of sets {S 1,…, S k } is laminar if for all i, j, either S i and S j are disjoint, or one includes the other. Theorem: A binary matrix M has a perfect phylogenetic tree iff the collection {O 1,…, O m } is laminar. C 1 C 2 C 3 C 4 C 5 A B C D E C 1 C 2 C 3 C 4 C 5 A B C D E Laminar Not Laminar
8 Proof of Theorem Assume M has a perfect phylogeny. Consider the edges labeled C i and C j : If there is a root-to-leaf path containing both edges ( C 1,C 2 below ), then O i includes O j or vice-versa. Otherwise, O i and O j are disjoint ( C 1,C 3 below ). A E D C B C4C4 C3C3 C5C5 C1C1 C2C2
9 Assume that the collection {O 1,…, O k } is laminar. We prove by induction on the number of characters k that M has a perfect phylogenetic tree. Basis: one character. There are at most two (distinct) objects, one with and one without this character. C1C1 A 1 B 0 C1C1 AB root Proof of Theorem (cont)
10 Assume that the collection {O 1,…, O k } is laminar. Induction step: assume correctness for n-1 characters. Consider a matrix with n characters (non-zero columns), and assume WLOG that O 1 is not contained in O j for all j > 1. S 1 – the set of objects i for which M i1 = 1. S 2 – the remaining objects. Claim: each character belongs to objects in S 1 or S 2, but not to both. By induction there are trees T 1 and T 2 for S 1 and S 2. C 1 C 2 C 3 C 4 C 5 A11000 B00100 C11001 D00110 E10000 T1T1 T2T2 C1C1 S 1 ={ A,C,E } S 2 ={ B,D } Proof of Theorem (cont) why is this?
11 Efficient Implementation 1. Sort the columns (characters) according to decreasing binary value. Claim: If the binary value of column i is larger than that of column j, then O i is not a proper subset of O j. Proof: O i > O j means the 1 ’s in O i are not covered by the 1 ’s in O j. C 1 C 2 C 3 C 4 C 5 A B C D E C 2 C 1 C 3 C 5 C 4 A B C D E 10000
12 why is this? 2. Make a backwards linked list of the 1 ’s in each row Claim: If the columns are sorted, then the set of columns is laminar iff for each column i, all the links leaving column i point at the same column. If the matrix is laminar then these pointers define the inclusion hierarchy Efficient Implementation (cont) C 2 C 1 C 3 C 5 C 4 A B C D E C 2 C1C1 C3C3 C5C5 C4C4 A B C D E 00110
13 (11000) (00100) (01000) (00110) (11001) (00000) (11000) (10000) (00100) 3. If the matrix is laminar, compute the inclusion hierarchy 4. Reconstruct topology of the phylogenetic tree and ancestral character states Efficient Implementation (cont) C 2 C 1 C 3 C 5 C 4 A B C D E C5C5 C1C1 C2C2 C4C4 C3C3 A E D C B C4C4 C3C3 C5C5 C1C1 C2C2
14 1. Sort the columns (characters) according to decreasing binary value. 2. Make a backwards linked list of the 1 ’s in each row 3. If the matrix is laminar, compute the inclusion hierarchy 4. Reconstruct topology of the phylogenetic tree and ancestral character states Complexity: O(mn) – use radix (bucket) sort in stage 1. Efficient Implementation - Summary C 1 C 2 C 3 C 4 C 5 A B C D E C 2 C 1 C 3 C 5 C 4 A B C D E 10000