. Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau
2 Evolution Evolution of new organisms is driven by u Diversity l Different individuals carry different variants of the same basic blue print u Mutations l The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. u Selection bias
3 Theory of Evolution u Basic idea l speciation events lead to creation of different species (speciation: physical separation into groups where different genetic variants become dominant) u Any two species share a (possibly distant) common ancestor u This is described by a rooted tree – the tree of life.
4 u Any two species share a (possibly distant) common ancestor u The process of evolution consists of: l speciation events. l mutations along evolutionary branches. Tree of Life Source: Alberts et al
5 Often only a subtree is studied Definition: A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.
6 Components of Phylogenenetic Trees u Leaves - current day species (or taxa – plural of taxon) u Internal vertices - hypothetical common ancestors u Edges length - “time” from one speciation to the next u The Tree Topology – the tree structure, ignoring edge lengths. Usually the goal is to find the topolgy. AardvarkBisonChimpDogElephant
7 Historical Note u Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) u Since then, focus on objective criteria for constructing phylogenetic trees l Thousands of articles in the last decades u Important for many aspects of biology l Classification l Understanding biological mechanisms
. A. Introduction (this lecture) 1. The phylogenetic Reconstruction Problem: from sequences to trees 2.Morphological vs. molecular sequences 3. Possible pitfalls 4. Directed and undirected trees 5. The “big” problem, the “small” problem. Outline
. B. Character based methods (this + next lectures) 1. Perfect Phylogeny 2. Maximum Parsimony 3. Maximum Likelihood (not studied in this course) These methods consider the evolution of each character separately. Try to find the tree which gives the “best” evolutionary explanation: - least number of observed mutations (1&2), or most probable tree (3). These optimization problems are typically NP-hard. We’ll discuss ways for solving simplified versions of the problems. Outline (cont)
C. Distance based methods (last 1-2 lectures) - Run in polynomial time - Compute distances between all taxon-pairs - Find a tree (edge-weighted) best-describing the distances Outline (cont)
. Distance Methods (cont.) 1.Efficient reconstruction ( O(n 2 ) time ) from accurate distances 2. Reconstruction from noisy distances: Can we reconstruct accurate trees from approximate distances? Worst-case noise model More realistic noise models: inter-species distances derived from probabilistic models of mutations. Outline (end)
12 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACGGTCA ACGGATA ACGGGTA ACCCGTG ACCGTTG TCTGGTA TCTGGGA TCCGGAAAGCCGTG GGGGATT AAAGTCA AAAGGCG AAACACA AAAGCTG Evolution as a Tree
13 AATCCTG ATAGCTG AATGGGC GAACGTA AAACCGA ACCGTTG TCTGGGA TCCGGAAAGCCGTG GGGGATT Phylogenetic Reconstruction
14 B : AATCCTG C : ATAGCTG A : AATGGGC D : GAACGTA E : AAACCGA J : ACCGTTG G : TCTGGGA H : TCCGGAA I : AGCCGTG F : GGGGATT Goal: reconstruct the ‘true’ tree as accurately as possible reconstruct A B C F G IHJ D E Phylogenetic Reconstruction
15 What are the sequences? l “Significant” (eg morphological) characters, which distinguish between species l Molecular characters: DNA (4 letters) Proteins (20 letters) Construct the tree by comparing “homologous” sequences.
16 What are the sequences? Morphological vs. Molecular u Classical methods. morphological features: l number of legs, lengths of legs, etc. u Modern methods. molecular features: l Gene (DNA) sequences l Protein sequences u Analysis based on homologous sequences (e.g., globins) in different species
17 Possible pitfall in reconstruction: Misleading selection of sequences u Gene/protein sequences can be homologous for several different reasons: u Orthologs -- sequences diverged after a speciation event u Paralogs -- sequences diverged after a duplication event (next slides) u Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)
18 Misleading selection of sequences: Using paralogs instead of orthologs Consider evolutionary tree of three taxa: …and assume that at some point in the past a gene duplication event occurred. Gene Duplication
19 Paralogs instead of Orthologs Speciation events Gene Duplication 1A 2A 3A3B 2B1B The gene evolution is described by this tree (1,2,3 are species; A, B are the copies of the same gene). Copy B Copy A
20 Speciation events Gene Duplication 1A 2A 3A3B 2B1B If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree. In the sequel we assume all given sequences are orthologs – created from a common ancestor by specification events. S S S Paralogs instead of Orthologs
21 Rooted vs. Undirected Trees A natural representation of phylogeny is rooted trees Common Ancestor
22 Types of trees Unrooted tree represents the same phylogeny without the root node Most known tree-reconstruction techniques do not distinguish between different placements of the root.
23 Rooted versus unrooted trees Tree a a b Tree b c Tree c Represents the three rooted trees
24 Positioning Roots in Unrooted Trees u We can estimate the position of the root by introducing an outgroup: l a set of species that are definitely distant from all the species of interest AardvarkBisonChimpDogElephant Falcon Proposed root
25 Two phylogenenetic trees of the same species: Do these trees represent the same evolutionary history? AardvarkBisonChimpDogElephant Aardvark Bison Chimp Dog Elephant
26 When two unrooted phylogenetic trees are considered different? Trees T 1 and T 2 on the same set of species are considered identical if they represent the same evolutionary history, i.e.: they have the same topology. Formally, this is equivalent to: There is a tree isomorphism h: T 1 T 2 s.t: For each species x, h(x)=x.
27 The two trees represent the same evolution AardvarkBisonChimpDogElephant Aardvark Bison Chimp Dog Elephant w v h(u)h(u) u h(w)h(w) h(v)h(v)
28 The “Big” reconstruction problem, the “Small” problem The “big” problem: compute the whole phylogenetic tree from the n input sequences. The “small” problem: Assume the tree topology and the identities of the leaf-species are known. Reconstruct the sequences at the internal vertices, and give a score to the resulted phylogeny. Connection between the problems: In order to solve the big problem, solve the small problem on all possible trees with n leaves, and output the tree(s) with the highest “score”. This is impossible in practice for more than few taxa.
29 Input for the “big” problem A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA Our task: Find evolutionary tree with leafs corresponding to the 5 sequences, which best explains the evolution of the strings.
30 Input for the “small” problem AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices.
31 Character-based methods for constructing phylogenies In this approach, trees are constructed by comparing the characters of the corresponding sequences. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). We will present two methods: “Perfect Phylogeny” and “Maximum Parsimony” Basic Assumption in these methods: Best tree is one with minimal number of observed mutations (character changes along the edges, aka substitutions).
32 Character based methods: Input data species C1C1 C2C2 C3C3 C4C4 …CmCm dog AACAGGTCTTCGAGGCCC horse AACAGGCCTATGAGACCC frog AACAGGTCTTTGAGTCCC human AACAGGTCTTTGATGACC pig AACAGTTCTTCGATGGCC *********** Each character (column) is processed independently. The green character will separate the human and pig from frog, horse and dog. The red character will separate the dog and pig from frog, horse and human.
33 The perfect phylogeny problem u A character is assumed to be a significant property, which distinguishes between species (e.g. dental structure, number of legs/limbs). u A characters state is a value of the character (eg: human dental structure). u Assumption: It is unlikely that a given state will be created twice in the evolution tree. Such characters are called “Homoplasy free”, and are detailed next.
34 Homoplasy-free characters 1 Homoplasy free characters should avoid: reversal transitions u A species regains a state it’s direct ancestor has lost. u Famous known exceptions: l Teeth in birds. l Legs in snakes.
35 Homoplasy-free characters 2 …and also avoid convergence transitions u Two species possess the same state while their least common ancestor possesses a different state. u Famous known exceptions: The marsupials.
36 Input: 1.A set of species 2.A set of characters 3.For each character, assignment of states to the species Problem: Is there a phylogenetic tree T=(V,E), s.t. the evolution of all characters is “homoplasy free” (no reversal, no convergence) The Perfect Phylogeny Problem First, we define the problem using graph- theoretic terms.
37 Characters = Colorings A coloring of a tree T=(V,E) is a mapping C:V [set of colors] A partial coloring of T is a coloring of a subset of the vertices U V: C:U [set of colors] U=
38 Each character defines a (partial) coloring of the corresponding phylogenetic tree: Characters as Colorings Species ≡ Vertices States ≡ Colors
39 Convex Colorings (and Characters) Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint Let T=(V,E) be a partially colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d
40 A character is Homoplasy free (avoids reversal and convergence transitions) ↕ The corresponding (partial) coloring is convex Convexity Homoplasy Freedom
41 Input: Partial colorings (C 1,…,C k ) of a set of vertices U (in the example: 3 total colorings: left, center, right, each by two colors). Problem: Is there a tree T=(V,E), s.t. U V and for i=1,…,k,, C i is a convex (partial) coloring of T? R B PR G P B B PR G A The Perfect Phylogeny Problem (pure graph theoretic setting) PP is NP-Hard In general In the tutorial you will see a special case solvable in p-time.
42 Maximum Parsimony Perfect Phylogeny is not only hard to compute, but in many cases it doesn’t exist. Next we discuss a more common approach, called “Maximum Parsimony”, which looks for a tree which minimizes the number of mutations.
43 Maximum Parsimony A Character-based method Input: u h sequences (one per species), all of length k. Goal: u Find a tree whose leaves are labeled by the input sequences, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized.
44 Example Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. AGA AAA GGA AAG AAA Total #substitutions = 4 By the parsimony principle, we seek a tree whose leaves are labeled by the input sequences, and assignment of sequences to internal vertices, with minimum total number of mutations (ie, letter changes) along the tree edges. Here is one possible tree + sequences assignment.
45 Example Continued Here are two other trees+ sequence assignments: AGA GGA AAA AAG AAA AGA AAA Total #substitutions = 3 GGA AAA AGA AAG AAA Total #substitutions = 4 The left solution is preferred over the right one. A solution has two parts: First, select a tree and label its leaves by the input sequences; then, assign sequences to the internal vertices.
46 Example With One Letter Sequences u Suppose we have five species, such that three have ‘C’ and two ‘T’ at a specified position u Minimal tree has only one evolutionary change: C C C C C T T T T C
47 Parsimony score AGA GGA AAA AAG AAA AGA AAA Parsimony score = 3 GGA AAA AGA AAG AAA Parsimony score = 4 The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T.
48 Parsimony Based Reconstruction We have here both the small and big problems: 1. The small problem: find the parsimony score for a given leaf labeled tree. 2.The big problem: Find a tree whose leaves are labeled by the input sequences, with the minimum possible parsimony score. 3.We will see efficient algorithms for (1). (2) is hard.
49 Example of Input for a Given Tree AardvarkBisonChimpDogElephant A : CAGGTA B : CAGACA C : CGGGTA D : TGCACT E : TGCGTA Given a tree whose leaves are labeled by sequences, we need only to assign strings to internal vertices.