Methods of molecular phylogeny Peter Norberg (Peter.norberg@gu.se)
Content Introduction to Evolution and taxonomy Phylogenetic analysis Algorithmics Applied phylogenetics Computer Software Practical session
Evolution Charles Darwin ”Tree of life” Phylogenetic tree Root = Ancestor to all species
Rooted or unrooted trees? Trees show evolutionary relationships The root shows direction
Different representations B C D A B C D A B C D A B C D A B C D
Trees can be based on: Outer appearances (example shape of bills) Functionality Complexity A combination of… ……….. ….. DNA, RNA, AA, gene order….
Phylogenetic trees based on DNA AATTGGCC AATAGGCC AATAGGCA AGTTGGCG AATAGGAC AATAGGCA AGTTGGCG TATTGGCG AATAGGAC TATTGGCG AATTGGCG
Phylogenetic trees based on DNA AATTGGCC AATAGGCC AATAGGAC AATTGGCG AGTTGGCG TATTGGCG AATAGGCA AATAGGAC AATAGGCA AGTTGGCG TATTGGCG
Genomic region Same genomic region for all taxa! Not too similar Not too diverged Insertions/deletions
Sequence alignment Aligned: Not aligned: (1) AATGGCAACCGCATTCAGGATTTAA (3) ATGGTAACCGCATTGAGGATTTAA (2) AATGGTAACCGCAAGGATTTAA (5) TGGTAACCGCATTCAGGAATTAA (4) AATGGTAACCGCATTCAGGAATTA Aligned: Not aligned: (1) AATGGCAACCGCATTCAGGATTTAA (1) AATGGCAACCGCATTCAGGATTTAA (2) AATGGTAACCGCAA GGATTTAA (2) AATGGTAACCGCAAGGATTTAA (3) ATGGTAACCGCATTGAGGATTTAA (3) ATGGTAACCGCATTGAGGATTTAA (4) AATGGTAACCGCATTCAGGAATTA (4) AATGGTAACCGCATTCAGGAATTA (5) TGGTAACCGCATTCAGGATTTAA (5) TGGTAACCGCATTCAGGATTTAA
Sequence alignment, our example AATTGGCC AATAGGCC AATAGGCA AGTTGGCG AATAGGAC TATTGGCG AATTGGCG AATTGGCC AATTGGCC AATAGGCC AATAGGCC AATTGGCG AATTGGCG AATAGGAC AATAGGAC AGTTGGCG AGTTGGCG TATTGGCG TATTGGCG AATAGGCA AATAGGCA
Phylogenetic principles Similar DNA sequences = closely related Inherited mutations. Simplest “route”! Homoplasy unlikely (not always true).
Homology vs. homoplasy Homology = similarity due to a common ancestor Homoplasy = similarity due to convergent evolution, but independent origins
Algorithms for constructing phylogenetic trees What is an algorithm? Several different phylogenetic algorithms exist. How do they work?
Algorithms for constructing phylogenetic trees Distance matrices Neighbour Joining UPGMA Maximum Parsimony Maximum Likelihood Bayesian inference
Distance matrices Based on the genetic distance Genetic distance based on nucleotide substitutions Typically # of differences / totalt # of nt AATTCCGG AATACCGG AATTAATG 1 2 3 1 0 2 1 0 3 3 4 0 1 2 3 1 0 2 0.125 0 3 0.375 0.5 0
Neighbour Joining Cluster in pairs Shortest distance first => Similar sequences located closely together in the tree Fast algorithm! 1 2 3 1 0 2 0.125 0 3 0.375 0.5 0 2 1 3 A B C D
Maximum Parsimony Utilizes so-called informative sites. Simplest path (fewest mutations) Build all possible trees. Choose the tree, which requires the fewest mutations Relatively fast
Maximum Parsimony, example 1 2 3 4 a 1 2 3 4 a AATTCC AAGTCC AAGTCT 1 3 2 4 a a a 1 2 4 3 a a 1 2 3 4 a 1 2 3 4 a 1 4 2 3 a
Maximum Likelihood and Bayesian inference Statistical method including an evolutionary model Summarize the likelihood for all columns Calculate the likelihood for all possible trees Good but slow! Bayesian inference faster
To test all possible trees Is it possible? => Takes too much time!!!! To analyze 20 taxa gives ~1022 different possible trees (10.000.000.000.000.000.000.000) What to do? => Use sophisticated algorithms to limit the search space….. Usually produce good results, but not necessarily the best
To root an unrooted tree Include an “outgroup” Outgroup = more distantly related (but not too distantly) Place the root where the outgroup connects to the tree
Rooting a tree outgroup A F B D A F C D B C E E G G
Significance Is the tree reliable? Is it the only probable? Bootstrap, Jack knife etc.
Bootstrap Construct several new sequence sets (1000 st.) A new sequence set is generated by randomly picking of columns from the original set Apply the phylogenetic algorithm on all sets. Make one consensus tree from all trees
Bootstrapping A: AACTTAACCACGCTATCGATGCAATTATATA B: AATTTGACTGCGGTACCGATCCAATTATATA C: AATTTGACTGGGCTACCGATCCAATTATATA D: AACTTAACCGCGCTACTGATCGAATTATATA A: CACC B: TGCT C: TGCT D: CAGC A D B C A C B D A B C D 96 1 3 96 1 3
Pitfalls? Homoplasy (convergent evolution) - Selection pressure Hyper variable regions Random events Gene duplication Recombination - Different regions have different ancestries
Recombination A B Recombination Recombinants
Detection of recombinants H X C A D E H B I F G
Detection of recombinants H X A B C D E F G H I A B C D E F G H I
Phylogenetic networks A B C D A B C D R A B C D R A B C D R A B C D R
Applied phylogenetics Reconstruct evolutionary history Animals, plants, bacteria, viruses, plasmids, …… Establish evolutionary mechanisms Functional studies Trace pandemic diseases Forensic medicine
Examples
Practical session
Phylip Software package for phylogenetic analysis Several small (command-line) applications Many different algorithms Widely used by the scientific community seqboot -> Constructs bootstrap sets dnapars -> Constructs a maximum parsimony tree consence -> Constructs a consensus tree drawtree -> Draws the tree
Herpes Simplex Virus Type 1 & 2 Usually asymptomatic Cause oral and genital lesions, encephalitis, meningitis and keratitis Transferred via direct contact Life long infection in the sensorial ganglia HSV-1: 70-80%, HSV-2: 20-30% ~100 nm in diameter. Capsid surrounded by envelope. Different glycoproteins in envelope. Photo by Linda M. Stannard, University of Cape Town.
HSV-1 US7 (Glycoprotein I)
Clinical samples