Phylogenetic basis of systematics Linnaeus: Ordering principle is God. Darwin: Ordering principle is shared descent from common ancestors. Today, systematics is explicitly based on phylogeny.
Goals of Phylogenetic Analysis Given a multiple sequence alignment, determine the ancestral relationships among the species. We assume that residues in a column are homologous, and that all columns have the same history. Time Hu Ch Go Gi
Types of Phylogenic Trees: 1. Cladogram: show the relationships between different organisms branch lengths are arbitary 2. Phylogram: branches that represent evolutionary time and amount of change.
Data Biomolecular sequences: DNA, RNA, amino acid, in a multiple alignment Molecular markers (e.g., SNPs, etc.) Morphology Gene order and content These are “character data”: each character is a function mapping the set of taxa to distinct states (equivalence classes), with evolution modelled as a process that changes the state of a character
DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT
Phylogenetic Analyses Step 1: Gather sequence data, and estimate the multiple alignment of the sequences. Step 2: Reconstruct trees on the data. (This can result in many trees.) Step 3: Apply consensus methods to the set of trees to figure out what is reliable.
Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X
Types of Phylogenetic Methods Character-based Parsimony Likelihood Distance-based Neighbor joining (NJ) UPGMA Involve optimizing a criterion based on fit of the residues to the tree. Involve optimizing a criterion based on fit of a matrix of pairwise distances to the tree
Select the tree that best recreates the observed pairwise distances. Parsimony Select the tree that explains the data with the fewest number of substitutions. Likelihood Select the tree that has the highest probability of producing the observed data Distance Select the tree that best recreates the observed pairwise distances. http://study.com/academy/lesson/maximum-parsimony-likelihood-methods-in-phylogeny.html https://www.youtube.com/watch?v=NRRErwFsIcw
Phylogenetic Tree Building Two basic types: Gene/protein tree: represents evolutionary history of genes/proteins Species tree: represents the evolutionary history of species based on characters (like protein sequences) Rooted, binary tree Unrooted, binary tree
Phylogenetic Tree Building Two basic types: Gene/protein tree: represents evolutionary history of genes/proteins Species tree: represents the evolutionary history of species based on characters (like protein sequences) Rooted, binary tree Unrooted, binary tree * Can root a tree using an outgoup: known distant relative
(modern observations) Branch lengths (“distance”) ~ time Root (ancestral species) Edges Nodes (common ancestor) Leaves (modern observations)
(modern observations) Branch lengths (“distance”) ~ time Root (ancestral species) Why is the structure of the tree important? Edges Nodes (common ancestor) Leaves (modern observations)
(modern observations) Branch lengths (“distance”) ~ time Root (ancestral species) Why is the structure of the tree important? Branching represents speciation into two new species Edges Nodes (common ancestor) Leaves (modern observations)
Branch lengths (“distance”) ~ time 8 7 Root (ancestral species) 6 5 4 3 2 1 This tree can also be denoted in text format
Branch lengths (“distance”) ~ time 8 7 Root (ancestral species) 6 5 4 3 2 1 This tree can also be denoted in text format ( ( ( (3,4) , (5,6) ), 7 ), (1,2) ), 8
Building phylogenetic trees Distance based methods a. Calculate evolutionary distances between sequences b. Build a tree based on those distances Maximum Parsimony (character based method) a. Find the simplest tree that explains the data with the fewest # of substitutions Maximum Likelihood (probabilistic method based on explicit model) a. Find the tree that is most likely, given an evolutionary model
Building phylogenetic trees Distance based methods Maximum Parsimony (character based method) Search all possible trees and find the one requiring the fewest substitutions A A G a G G A b A A A c A G A d
Building phylogenetic trees Distance based methods Maximum Parsimony (character based method) Search all possible trees and find the one requiring the fewest substitutions A A G a G G A b A A A c A G A d
Building phylogenetic trees Distance based methods Maximum Parsimony (character based method) Search all possible trees and find the one requiring the fewest substitutions A A G a A A A c G G A b A G A d What are the ancestral sequences at each node? How many base changes are required for this tree?
Building phylogenetic trees Distance based methods Maximum Parsimony (character based method) Search all possible trees and find the one requiring the fewest substitutions A A A A A G a A A A c G G A b A G A d A A A or A G A A G A What are the ancestral sequences at each node? How many base changes are required for this tree? 3 changes are required.
Building phylogenetic trees Distance based methods Maximum Parsimony (character based method) Search all possible trees and find the one requiring the fewest substitutions A A A A A G a A A A c G G A b A G A d A A A or A G A A G A The score of the tree is the number of character changes. MP aims to minimize the score of tree.
How can you tell if your tree is significant? Bootstrapping: how dependent is the tree on the dataset 1. Randomly choose n objects from your dataset of n, with replacement 2. Rebuild the tree based on the subset of the data 3. Repeat 1,000 – 10,000 times 4. How often are the same children joined? If a given node is represented in <x trials, collapse the node for a ‘consensus’ tree Jackknifing: how dependent is the tree on the dataset 1. Randomly choose k objects from your dataset of n, without replacement 2. Rebuild the tree based on the subset of the data 3. Repeat 1,000 – 10,000 times 4. How often are the same children joined?
How can you tell if your tree is significant? 70 100 80 95 100
Maximum Likelihood tree showing Bayesian Inference/Maximum Parsimony/Maximum Likelihood support value at each node