Phylogenetics
Review Multiple sequence alignment ClustalW Steps 1. pairwise alignments 2. UPGMA or Neighbor-Joining tree based on pairwise scores (guide tree) 3. Multiple alignment informed by guide tree
Multiple sequence alignment (Phylip format) 20 372 ThNM012b TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM012 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM043 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThlanugQH0 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM069 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM070 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM037 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM076 TTCCGCCGGG GGGGTNGTCC CNNGGCTCGG TGTGCCCCCG GGGCCCGTGC ThNM032 TTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TGTGCCCCCG GGGCCCGTGC ThNM075 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM007 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC Talthermo CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCGCGTGC ThNM073 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC ThNM002 CTCCGCCGGG GGGGTCGTCC CGGGGCGCGG TTT--TGCCG GGGCCCGTGC AfumHQ6310 GGCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM0026A -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC ThNM025a -GCCGCCGGG GAGGC-CTTG CGC------- -----CCCC- GGGCCCGCGC Aspni5 -GCCGCCGGG GGGGCGCCTC TGC------- -----CCCCC GGGCCCGTGC ThNM001 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC ThaurantT8 --CCGCCGGG GGGCGTGTCC CGC------- -----CCCC- GGGCCCGCGC
Newick tree format ( Physella_anatina/gb|AY651175.:0.00993, Physa_heterostropha/gb|AY6511:0.00165, Physa_acuta/gb|AY651188.1|:0.00598) :0.00687) :0.00137, Physella_virgata/gb|AY651170.1:0.00474, Lymnaea_stagnalis/gb|EF489390.:0.07009, Lymnaea_neotropica/emb|AM49400:0.07367) :0.00980, Biomphalaria_glabrata/gi|34538:0.08976) :0.09811);
Branch length may or may not reflect distance or time
A B C A B C A B C Three possible rooted trees with three taxa (unrooted tree has no meaning with 3 taxa)
A B A C A D A B C D C D B D C B Four possible unrooted trees with four taxa
Characters and character states Important terms Characters and character states Ancestral versus derived (not primitive versus advanced)
Invertebrate Fish Humans Dogs Birds animals Some examples of ancestral and derived characters Invertebrate Fish Humans Dogs Birds animals Upright posture loss of body hair feathers and feathered wings bony limbs bony skeleton nervous system Note: Ancestral and derived are relative terms. In this tree, a character is ancestral to nodes higher in the tree, but derived with respect to nodes Lower in the tree.
The problem of taxonomy not reflecting phylogeny
polyphyletic groups contain taxa that are not derived from a single common ancestor Old groups “algae” and “fungi” were polyphyletic Brown Algae Oomycete Fungi Green Plants Green Algae True Fungi animals
paraphyletic (subset of polyphyletic) groups have a taxonomic group contained within another group of equal status
Old “Reptilian” class is paraphyletic Crocodiles Birds Lizards
Estimating phylogeny (phylogenetics) Distance methods (not “phylogenetic”) - Examples: UPGMA, Fitch-Margolish, Neighbor Joining - Begin with a single measure of similarity or distance for every pair of taxa Phylogenetic methods (“phylogenetic”) - Examples: parsimony, maximum likelihood, Bayesian (Mr.Bayes) - Look at multiple discrete characters and use differences among character states to infer phylogeny
Distance (also sometimes called numerical) methods rely on a single numerical value that expresses the difference or similarity for any given pairwise comparison. With DNA data this value is usually obtained by dividing number of matching nucleotide positions by the average length of the two sequences compared. Species 1: 3510188 CTGATCCGAGGTCAACCTTGGGTT-GTGAAGGTCGTTTTACGGCTGGAAC 3510237 |||||||||||||||||||||| | | ||||||||||||||||||||||| species 2: 562 CTGATCCGAGGTCAACCTTGGGGTCGCGAAGGTCGTTTTACGGCTGGAAC 513
Estimating phylogeny (phylogenetics) Distance methods (not “phylogenetic”) - Examples: UPGMA, Fitch-Margolish, Neighbor Joining - Begin with a single measure of similarity or distance for every pair of taxa Phylogenetic methods (“phylogenetic”) - Examples: parsimony, maximum likelihood, Bayesian (Mr.Bayes) - Look at multiple discrete characters and use differences among character states to infer phylogeny
Character #2 - + - + - - + + - + + - A B C D A C B D A D B C Tree #1 Tree #2 Tree #3
Some potential pitfalls with molecular data
Paralogs versus Orthologs Orthologs - homologous genes that reflect speciation Paralogs - homologous genes that reflect gene duplication = members of a gene family in a single organism (examples: alpha versus beta hemoglobin; red versus green visual pigment proteins Important to distinguish between these when doing comparative analyses (It’s sometimes hard to tell)
The Problem of Multiple Hits
Among other problems, this causes “long-branch attraction”
Scoring in phylogenetic methods is model dependent
General idea applies to protein amino-acid sequences as well
Can convert to scoring matrix based on log probablilities
7 1527 Physa_hete ---------- ---------- ---------- --------AA CATTATATTT Physa_acut ---------- ---------- ---------- --------AC CATTATATTT Physella_a ---------- ---------- ---------- --------AA CATTATATTT Physella_v ---------- ---------- ---------- --------AA CATTATATTT Lymnaea_st ---------- ---------- ---------- ---------- -----TTTAT Lymnaea_ne ---------- ---------- ---------- GATATTGGTA CTTTATATAT Biomphalar TTGCGTTGAC TCTTTTCAAC AAACCATAAA GATATTGGTA CTTTGTACAT AATTTTTGGG ATCTGGTGTG GATTGGTCGG TACAGGTTTA AGCTTGTTAA AATTTTTGGT GTTTGATGCG GTTTAGTGGG AACAGGTTTA TCCTTATTAA AATCTTTGGA ATCTGATGCG GGTTAGTAGG GACTGGATTG TCTTTATTAA AATTTTTGGA ATTTGGTGTG GTCTAGTTGG TACTGGATTA TCATTATTGA etc.
Bootstrap Analysis
7 1527 Physa_hete ---------- ---------- ---------- --------AA CATTATATTT Physa_acut ---------- ---------- ---------- --------AC CATTATATTT Physella_a ---------- ---------- ---------- --------AA CATTATATTT Physella_v ---------- ---------- ---------- --------AA CATTATATTT Lymnaea_st ---------- ---------- ---------- ---------- -----TTTAT Lymnaea_ne ---------- ---------- ---------- GATATTGGTA CTTTATATAT Biomphalar TTGCGTTGAC TCTTTTCAAC AAACCATAAA GATATTGGTA CTTTGTACAT AATTTTTGGG ATCTGGTGTG GATTGGTCGG TACAGGTTTA AGCTTGTTAA AATTTTTGGT GTTTGATGCG GTTTAGTGGG AACAGGTTTA TCCTTATTAA AATCTTTGGA ATCTGATGCG GGTTAGTAGG GACTGGATTG TCTTTATTAA AATTTTTGGA ATTTGGTGTG GTCTAGTTGG TACTGGATTA TCATTATTGA etc.