Introduction to Phylogenetic Systematics

Slides:



Advertisements
Similar presentations
Genes: organization, function and evolution
Advertisements

Modelling Protein Synthesis Jessie Maher. In this experiment, we produced a simple model of a section of DNA, and modelled the processes involved in protein.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Biology Ch. 12 Review.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Some basics: Homology = refers to a structure, behavior, or other character of two taxa that is derived from the same or equivalent feature of a common.
Bioinformatics and Phylogenetic Analysis
Unit 6 DNA. Griffith Experiment DNA Structure DNA is a polymer made of monomers called nucleotides Each nucleotide is made of: – A phosphate group –
Unit 7 Vocabulary Watson & Crick What are the 3 parts of RNA?
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
DNA & PROTEIN SYNTHESIS CHAPTERS 9 &10. Main Idea How are proteins made in our bodies?
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Lecture 2: Principles of Phylogenetics
Evolutionary Biology Concepts Molecular Evolution Phylogenetic Inference BIO520 BioinformaticsJim Lund Reading: Ch7.
Introduction to Phylogenetics
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
DNA. Unless you have an identical twin, you, like the sisters in this picture will share some, but not all characteristics with family members.
1 Genes and Proteins The genetic information contained in the nucleotide sequence of DNA specifies a particular type of protein Enzymes = proteins that.
Phylogeny and the Tree of Life
Evolutionary genomics can now be applied beyond ‘model’ organisms
Unit 5: DNA and Protein Synthesis
Protein Synthesis Human Biology.
Phylogenetic basis of systematics
Nucleic acid Dr. Sahar Al Shabane.
Inferring a phylogeny is an estimation procedure.
The ideal approach is simultaneous alignment and tree estimation.
Chapter 12 DNA and RNA.
Protein Synthesis.
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
Methods of molecular phylogeny
Nucleic Acids and Protein Synthesis
RNA.
Patterns in Evolution I. Phylogenetic
Genetics: The Science of Heredity
Transcription.
Deoxyribonucleic Acid
Genetics Unit Review.
What is RNA? Do Now: What is RNA made of?
Chapter 12 DNA and RNA.
CHAPTER 12 Review.
Why Models of Sequence Evolution Matter
How Proteins are Made Biology I: Chapter 10.
RNA: Structures and Functions
Genetics Refresher Guide
Copyright Pearson Prentice Hall
DNA and Genes Chapter 13.
Copyright Pearson Prentice Hall
LECTURE 5: DNA, RNA & PROTEINS
Unit Genomic sequencing
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Nucleic Acids.
Copyright Pearson Prentice Hall
DNA Structure and Function Notes
RNA.
DNA Deoxyribonucleic Acid.
Presentation transcript:

Introduction to Phylogenetic Systematics Mark Fishbein Dept. Biological Sciences Mississippi State University 13 October 2003

Which of these critters are most closely related?

alligator gila monster purple gallinule ? gopher tortoise kingsnake

Phylogeny Branching history of evolutionary lineages New branches arise via speciation Speciation occurs when gene flow is severed between populations Phylogenetic relationships depicted as a tree

© W. S. Judd, et al., Plant Systematics

© W. S. Judd, et al., Plant Systematics

} Phylogenetic data Morphology Secondary chemistry Cytology Allele frequencies Protein sequences Restriction sites DNA sequences } “Molecular” data

Molecular (genetic) data Proteins Serology (immunoassay) Isozymes (electrophoretic variants) Amino acid sequences DNA Structural (translocations, inversions, duplications) Restriction sites DNA sequences Substitutions Insertions/Deletions

What are genes? From Raven et al. (1999), Biology of Plants

Genomes All of the genes within a cell are the genome Genes located in the nucleus are the nuclear genome Other genomes (organellar) Mitochondrion: mitochondrial genome Chloroplast: plastid genome

nucleus chloroplast mitochondrion From Raven et al., 1999, Biology of Plants

Comparison of Genomes Nuclear Mitochondrial Plastid Size Large Small Number Multiple Single Shape of Chromosomes Linear Circular Ploidy Diploid Haploid Inheritance Biparental Uniparental

Structural rearrangements Inversion Crossing over, duplication, and loss From Freeman and Herron (1998), Evolutionary Analysis

Chemistry of Genes DNA Parallel strands linked together Linear array of units called nucleotides Phosphate Sugar: deoxyribose One of four bases Adenine (“A”) Cytosine (“C”) Guanine (“G”) Thymine (“T”)

From Raven et al. (1999), Biology of Plants

DNA structure Paired strands are linked by bases A must bond with T G must bond with C Each link is composed of a purine and a pyrimidine A & G are purines C & T are pyrimidines

DNA function DNA is code for making proteins (and a few other molecules) Proteins are the structures and enzymes that catalyze biochemical reactions that are essential for the function of an organism DNA code is read and converted to protein in two steps Transcription: DNA is copied to messenger RNA Translation: messenger RNA is template for protein

DNA code A gene is a code composed of a string of nucleotide bases (A’s, C’s, G’s, T’s) A protein is composed of a string of amino acids (there are 20) How does the DNA code get translated into protein?

DNA code Each amino acid is coded for by at least one triplet of nucleotide bases in DNA Each triplet is called a codon There are 64 possible codons (4 bases, 3 positions = 43)

From Raven et al. (1999), Biology of Plants

DNA functional classes Coding Proteins (exons) Ribosomes (RNA) Transfer RNA “Non-coding” Introns Spacers

From Raven et al. (1999), Biology of Plants

Homology in Molecular Systematics Assess orthology Align sequences Homology is often implicit (is this a good thing?)

DNA Sequences and Homology Homology: similarity due to common descent How do we assess homology of DNA sequences? Levels of homology Locus Allele Nucleotide position

From W. P. Maddison (1997), Systematic Biology 46:527

Orthology vs. Paralogy DNA sequences that are at homologous loci are orthologous DNA sequences that are similar due to duplication but are at different loci are paralogous Orthology may be best detected with a phylogenetic analysis of all sequences

From Martin & Burg (2002), Systematic Biology 51:578

Multiple Sequence Alignment Goal: create data matrix in which columns are homologous positions Problem: sequences vary in length Why? Insertions Deletions

Simple Sequence Alignment Taxon 1 GTACGTTG Taxon 2 GTACGTTG Taxon 3 GTACGTTG Taxon 4 GTACATTG Taxon 5 GTACATTG Taxon 6 GTACATTG

Simple Sequence Alignment Taxon 1 GTACGTTG Taxon 2 GTACGTTG Taxon 3 GTACGTTG Taxon 4 GTACATTG Taxon 5 GTACATTG Taxon 6 GTACATTG

DNA Sequence Data Matrix G T A C T2 T3 T4 T5 T6

Slightly Less Simple Sequence Alignment Taxon 1 AGAGTGAC Taxon 2 AGAGTGAC Taxon 3 AGAGTGAC Taxon 4 AGAGGAC Taxon 5 AGAGGAC Taxon 6 AGAGGAC

Slightly Less Simple Sequence Alignment Taxon 1 AGAGTGAC Taxon 2 AGAGTGAC Taxon 3 AGAGTGAC Taxon 4 AGAG-GAC Taxon 5 AGAG-GAC Taxon 6 AGAG-GAC

Alignment Gaps Gaps are inserted to maximize homology across nucleotide positions Gaps are hypothesized indels Inserting a gap assumes that an indel event is a better explanation of the differences among sequences than nucleotide substitution

Taxon 1 AGAGTGAC Taxon 2 AGAGTGAC Taxon 3 AGAGTGAC Taxon 4 AGAGGAC 3 substitutions 0 indels 0 substitutions 1 indels

Ambiguous Alignment with a Single-Base Indel Taxon 1 GGTCAG Taxon 2 GGCCAA Taxon 3 AGCTAA Taxon 4 AGCAA Taxon 5 AGCAA Taxon 6 AGCAA

Ambiguous Alignment with a Single-Base Indel Taxon 1 GGTCAG GGTCAG Taxon 2 GGCCAA GGCCAA Taxon 3 AGCTAA AGCTAA Taxon 4 AG-CAA AGC-AA Taxon 5 AG-CAA AGC-AA Taxon 6 AG-CAA AGC-AA 4 substitutions 1 indels 4 substitutions 1 indels

Gap Number and Length All else being equal, is it better to assume fewer longer gaps, or more shorter gaps? In other words, what is more likely: For a new indel to occur? For an existing indel to lengthen? There is no general answer! Alternate alignments are explored algorithmically

Alignment Algorithms Typically built up from pairwise alignments, using assumed gap costs Problem: most algorithms require an initial tree to define alignment order--bias Solution: simultaneous tree estimation and alignment optimization Problems: costly, unjustifiable parameters

Clustal Alignment Algorithm Creates alignment based on penalties for gap opening (number of gaps) and gap extension (gap length) Multiple alignment built according to guide tree determined by pairwise alignments Order of adding sequences determined by a guide tree

Clustal Alignment Algorithm Distance matrix calculated from pairwise comparisons Dendrogram calculated from from distance matrix Additional sequences are added according to dendrogram, until all sequences are added Alignment calculated for most similar pair of sequences, based on alignment parameters

Tree-Based Alignment Simultaneous tree and alignment estimation using parsimony TreeAlign MALIGN Implement similar gap opening/extension costs These applications are very slow!

Alignment in the Future? Incorporate a more sophisticated understanding of molecular evolution in parameterization For example, what are realistic values of gap costs? Are they universal? Can phylogeny estimation proceed without optimizing alignments? Likelihood based methods can sum over all alignments Will require major contribution of biologists

Methods of tree estimation Character based Maximum parsimony (MP) Fewest character changes Maximum likelihood (ML) Highest probability of observing data, given a model Bayesian Similar to ML, but incorporates prior knowledge Distance based Minimum distance Shortest summed branch lengths

Major classes of data Character-based Distance-based Bird A G T Alligator Lizard C Snake Tortoise Character-based Distance-based Alligator Lizard Snake Tortoise Bird 0.20 0.60 1.00 0.00 0.80

Minimum Distance Alligator Lizard Snake Tortoise Bird 0.20 0.60 1.00 0.00 0.80

Maximum Parsimony 3, 5 are slightly more 2: C complicated... 1: A 4: G Bird A G T Alligator Lizard C Snake Tortoise 3, 5 are slightly more complicated... 2: C 1: A 4: G

Parsimony Criterion j = character L = tree length N = number of characters w = character weight diff (x1, x2) = number of steps along branch L = tree length = topology k = branch B = number of branches

Parsimonious Character Reconstruction To evaluate the parsimony of a tree, each character is optimized (then the sum is computed) Several parsimony algorithms have been developed that optimize character reconstructions Algorithms differ in assumptions about permissible transformations between character states

Likelihood Criterion L = tree likelihood = topology j = character (site) l = site likelihood

Site Likelihoods lk, the probability of the nucleotides of each sequence at a given site, is the product of probabilities along the branches of the tree The probability along a branch is the product of Probability of a substitution Branch length Summed over ancestral states

Substitution Model Many models have been proposed Elements are the rate of substitution of one base for another, per site Rates are instantaneous (probability of change in a short period of time) Rates may be allowed to vary among sites

Maximum Likelihood 1 2 3 4 5 Bird A G T Alligator Lizard C Snake Tortoise Tree is selected that maximizes likelihood of observed sequences, given a model of substitution

The Molecular Systematics Revolution Dramatic increase in the size of data sets Characters, taxa Increased confidence in homology assessment? Computational advances Technology Algorithms and software

Large Data Sets Pre-1990: up to ~25 taxa (rarely 100), 1 gene or up to 100 morphological characters 1998: 2538 rbcL sequences of green plants; entire mtDNA sequences (15,000 bp) in animals 2004: ??

The Large Data Set Headache Problem: the large number of sequences, not the size of sequences Application of phylogenetic optimality criteria requires evaluation of all possible trees Algorithms guaranteed to find optimal solutions have limited applicability

The Problem of Finding Optimal Trees There are too many trees to evaluate! The number of possible topologies increases very rapidly with the number of taxa/samples There are [(2m - 5)!] / [2m-3(m-3)!] unrooted trees , where m = number of taxa

Taxa Trees 3 1 4 3 5 15 7 945 9 135,135 stars in the universe atoms in the universe From Hillis et al. (1996), Applications of Molecular Systematics

Heuristic Methods Starting trees followed by rearrangement Starting trees sample “tree space” Rearrangements search for local optima How to get starting trees? How to rearrange trees? These methods are prone to getting trapped on local optima

Cutting-edge Methods Ratchet Annealing algorithms Genetic algorithms Temporarily “warps” the character space Annealing algorithms Accept suboptimal trees and gradual movement towards optima Genetic algorithms Analogous to evolution by natural selection

Using Phylogenies Why are birds so different than their closest relatives? What genes are involved in the origin of novel traits (like wings)? Is the rate of molecular evolution in birds especially high?