. Perfect Phylogeny Tutorial #11 © Ilan Gronau Original slides by Shlomo Moran
2 The input: a species-characters matrix The ouput: a tree with n leaves corresponding to the input species Phylogenetic Reconstruction n species k characters Each character represents some observable trait. Each character takes values from a finite set.
3 Find a homoplasy-free tree explaining the input vector set (if such a tree exists) Perfect Phylogeny
4 no reversals Homoplasy-Free Characters no convergence Homoplasy-free characters induce a convex coloring of the phylogenetic tree The Perfect Phylogeny Problem: Given character-vectors for S, find: -a phylogenetic tree T over S. ( S is the leaf-set of T ) -convex character assignments to all vertices of T. ! This problem is generally NP-hard ! If exists
5 Directed binary characters: 0 – property exists 1 – property doesn’t exist Initially (at the root) all propertied do not exist. Input: k binary colorings C 1... C k of the species set S. Output: 1. A rooted phylogenetic tree T over S. 2. k binary colorings C’ 1... C’ k of the vertices of T which are: a.extensions of C 1... C k. b.induce a ‘0’ coloring at the root of T. Directed Binary Perfect Phylogeny We will present a polynomial-time solution Or notification that such a tree doesn’t exist
6 A E D C B (11000) (00100) (01000) (00110) (11001) k characters n species Example C 1 C 2 C 3 C 4 C 5 A B C D E Input: Possible output: (00000) (11000) (01000) (00100) C2C2 C3C3 zero-root
7 A tree is a directed perfect phylogeny for a given 0/1 matrix iff we can map each character to a vertex on which this character was “turned on”. C 1 C 2 C 3 C 4 C 5 A B C D E A E D C B C4C4 C3C3 C1C1 C5C5 Example: An Important Observation C2C2
8 Laminar Matrices Definitions: O j – set of species that have character C j ( O j ={i : M ij =1} ). A collection of sets {S 1,…, S k } is laminar if for all i, j, either S i and S j are disjoint, or one includes the other. Theorem: A binary matrix M has a perfect phylogenetic tree iff the collection {O 1,…, O k } is laminar. C 1 C 2 C 3 C 4 C 5 A B C D E C 1 C 2 C 3 C 4 C 5 A B C D E Laminar Not Laminar
9 Proof of Theorem Assume M has a perfect phylogeny. Consider the vertices labeled C i and C j : If C i is an ancestor of C j ( C 2,C 1 below ), then O i includes O j. If neither of them is an ancestor of the other ( C 3,C 1 below ), then O i and O j are disjoint. A E D C B C4C4 C3C3 C1C1 C5C5 C2C2
10 Assume that the collection {O 1,…, O k } is laminar. We (constructively) prove that M has a perfect phylogenetic tree. Proof outline: Consider the inclusion graph of {O 1,…, O k }. Removing “unnecessary” edges results in a directed forest. Add a root and connect it to all sources, and add edges from leaves of the inclusion tree to the singletons representing the input species. Proof of Theorem (cont) C 1 C 2 C 3 C 4 C 5 A B C D E A E D C B C4C4 C3C3 C1C1 C5C5 C2C2
11 Efficient Implementation 1. Sort the columns (characters) according to decreasing binary value. Claim: If the binary value of column i is larger than that of column j, then O i is not a proper subset of O j. Proof: C i > C j means the 1 ’s in C i are not covered by the 1 ’s in C j. Corolary: the parent of C i is the closest C j s.t. C i is a proper subset of C j. C 1 C 2 C 3 C 4 C 5 A B C D E C 2 C 1 C 3 C 5 C 4 A B C D E 10000
12 why is this? 2. Make a backwards linked list of the 1 ’s in each row Claim: If the columns are sorted, then the set of columns is laminar iff for each column i, all the links leaving column i point at the same column. Efficient Implementation (cont) C 2 C 1 C 3 C 5 C 4 A B C D E C 2 C1C1 C3C3 C5C5 C4C4 A B C D E If the matrix is laminar, reverse the pointers to get the inclusion tree. Add root and leaves, as stated in slide #10.
13 1. Sort the columns (characters) according to decreasing binary value. 2. Make a backwards linked list of the 1 ’s in each row 3. If the matrix is laminar, reverse the pointers to get the inclusion tree. Add root and leaves, as stated in slide #10. Complexity: O(nk) – use radix (bucket) sort in stage 1. Efficient Implementation - Summary C 1 C 2 C 3 C 4 C 5 A B C D E C 2 C 1 C 3 C 5 C 4 A B C D E 10000