Introduction to Molecular Evolution

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
1 Dan Graur Methods of Tree Reconstruction. 2 3.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Bioinformatics I Fall 2003 copyright Susan Smith 1 Phylogenetic Analysis.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Lecture 24 Inferring molecular phylogeny Distance methods
Probabilistic methods for phylogenetic trees (Part 2)
Phylogenetic trees Sushmita Roy BMI/CS 576
What Is Phylogeny? The evolutionary history of a group.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
A brief introduction to phylogenetics
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Introduction to Bioinformatics Resources for DNA Barcoding
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Phylogenetic Inference
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
Patterns in Evolution I. Phylogenetic
The Tree of Life From Ernst Haeckel, 1891.
Systematics: Tree of Life
Inferring phylogenetic trees: Distance and maximum likelihood methods
Molecular Evolution.
Summary and Recommendations
Systematics: Tree of Life
#30 - Phylogenetics Distance-Based Methods
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Introduction to Molecular Evolution Mike Thomas October 3, 2002

What we can learn from multiple sequence alignments An alignment is a hypothesis about the relatedness of a set of genes This information can be used to reconstruct the evolutionary history of those genes The history of the genes can provide us with information about the structure and function, and significance of a gene or family of genes We can also use the reconstructed history to test hypotheses about evolution itself: Rates of change The degree of change Implications of change, etc We can then pose and test hypotheses about the evolution of phenomena unrelated to the genes Evolution of flight in insects Evolution of humans Evolution of disease

Assumptions made by phylogenetic methods: The sequences are correct The sequence are homologous Each position is homologous The sampling of taxa or genes is sufficient to resolve the problem of interest Sequence variation is representative of the broader group of interest Sequence variation contains sufficient phylogenetic signal (as opposed to noise) to resolve the problem of intereest Each position in the sequence evolved independently

How do you extract this information from an alignment?

Answer: a tree Haeckel’s Tree of Life “Higher” organisms Haeckel’s Tree of Life “Lower” organisms A phylogenetic tree is a hierarchical, graphical representation of relationships

Other Ways to Represent Phylogenies Cladogram showing the phylogenetic relationships between four species. Relationships of the same four species represented as a set of nested parentheses. Evolutionary relationships of the same four species with nine synapomorphies (shared, derived characters) plotted on the branches.

Using Phylogeny to Understand Gene Duplication and Loss A gene tree. The gene tree superimposed on a species tree, allowing identification of the duplication and loss events.

Problems with Phylogenetic Inference How do we know what the potential candidate trees are? How do we choose which tree is (most likely) the true tree?

Number of Possible Trees Number of taxa or genes Number of possible rooted trees 3 4 15 5 105 7 10,395 A B C B A C C B A

Recipe for reconstructing a phylogeny Select an optimality criterion Select a search strategy Use the selected search strategy to generate a series of trees, and apply the selected optimality criterion to each tree, always keeping track of the “best” tree examined thus far

Search strategy: Which is the right tree? When m is the number of taxa, the number of possible trees is: [(2m-3)!]/[2m-2(m-2)!] For 10 taxa, the number of trees is 34,459,425 Many trees can be discarded because they are obviously wrong Sometimes, there is a general or even specific grouping that can serve as a start for the tree search There are a number of approaches to tree searches that can be used

Search Strategies Strategy Type Stepwise addition Algorithmic Star decomposition Exhaustive Exact Branch & bound Branch swapping Heuristic Genetic algorithm Markov Chain Monte Carlo heuristic But, we still need to evaluate the trees in order to identify the one most likely to be the true tree

Choose an optimality criterion to evaluate trees Commonalities can be found, but how can these be used to evaluate a tree?

General differences between optimality criteria Minimum evolution Maximum Parsimony Maximum Likelihood Model based “Model free” Can account for many types of sequence substitutions Assumes that all substitutions are equal Works well with strong or weak sequence similarity Works only when sequence similarity is high Computationally fast Computationally slow Well understood statistical properties (easy to test) Poorly understood statistical properties (hard to test) Can accurately estimate branch lengths (important for molecular clocks) Cannot estimate branch lengths accurately Can estimate branch lengths with some degree of accuracy

Maximum Parsimony The parsimony score is the minimum number of required changes, or steps Only shared, derived characters are used The score for each character (site) is called the character score Site lengths added over all sites is the tree length The tree (out of all examined trees) with the lowest tree length is the most parsimonious tree… and most likely to be the true tree

Example: Maximum Parsimony Tree length: 6 steps Tree length: 12 steps X H G G H F F X 5 1 20 teeth, 5 toes, 10 ribs, round lobes, long legs 2 4 20 teeth 1 5 3 toes, round lobes 10 ribs, 5 toes, round lobes, long legs 4 toes, short legs, 8 ribs, 16 teeth, oval lobes 4 toes round lobes, 20 teeth, 25 verts, 10 ribs, 5 toes, long legs oval lobes, 16 teeth, 25 verts, 8 ribs, 3 toes, short legs

Simple example of parsimony with sequence data

Another example with nucleotide data Alignment of four hypothetical DNA sequences. Most parsimonious rooted cladogram for this alignment. Corresponding unrooted cladogram.

Issues & problems with parsimony Multiple trees may be the most parsimonious (have the same tree length) A consensus tree can be constructed to visualize the congruity & discontinuity between these Branch lengths (and, therefore, rates of change) cannot be accurately estimated No explicit model of change is used, even when one might be well supported The most parsimonious tree(s) may not be the true tree

Minimum Evolution (Distance) All data are used, even though some may not be shared, derived characters The branch lengths represent distance between a taxon and an ancestor, given an assumed model of evolution The pairwise distances are calculated for each pair of taxa, given an assumed model of evolution The tree length is the sum of branch length across a tree The tree (out of all examined trees) with the lowest tree length is the minimum evolution tree… and most likely to be the true tree

The tree is different than a parsimony tree Hypothetical evolutionary relationships between three DNA sequences, in which the horizontal branch lengths are proportional to the number of character-state changes along the branches. Topology of the parsimonious cladogram that would be constructed from the sequence similarities produced by such an evolutionary history if multiple substitutions had occurred at several sites.

Models of evolution: choosing parameters Factors that Affect Phylogenetic Inference Relative base frequencies (A,G,T,C) Transition/transversion ratio Number of substitutions per site Number of nucleotides (or amino acids) in sequence Different rates in different parts of the molecule Synonymous/non-synonymous substitution ratio Substitutions that are uninformative or obfuscatory Parallel substitutions Convergent substitutions Back substitutions Coincidental substitutions In general, the more factors that are accounted for by the model (i.e., more parameters), the larger the error of estimation. It is often best to use fewer parameters by choosing the simpler model.

Some distance models: p-distance p = nd/n, where n is the number of sites (nucleotides or amino acids), and nd is the number of differences between the two sequences examined. Very robust when divergence times are recent and the affect of complicating phenomena is minor

Some distance models: Jukes-Cantor Used to estimate the number of substitutions per site The expected number of substitutions per site is: d = 3αt = -(3/4)ln[1-(4/3)p], where p is the proportion of difference between 2 sequences Variance can be calculated No assumptions are made about nucleotide frequencies, or differential substitution rates A T C G A T C G - α -α

Some distance models: Kimura two-parameter Used to estimate the number of substitutions per site d = 2rt, where r is the substitution rate (per site, per year) and t is the generation time; r = α + 2β, so: d = 2αt + 4βt Accounts for different transition and transversion rates No assumptions are made about nucleotide frequencies, variance is greater than Jukes-Cantor C T A G Pyrimidines Purines   = transition rate = transversion rate These are treated the same for long divergence times.

Other models Hasegawa, Kishino, Yano (HKY): corrects for unequal nucleotide frequencies and transition/ transversion bias into account Unrestricted model: allows different rates between all pairs of nucleotides General Time Reversible model: allows different rates between all pairs of nucleotides and corrects for unequal nucleotide frequencies Many other models have been invented to correct for specific problems The more parameters are introduced, the larger the variance becomes

Ways to build trees with distance models: ME Minimum Evolution (ME) trees can be found by exhaustive searches or heuristic searches (starting with a reasonable tree or eliminating unlikely possible trees) For each tree examined, the total tree length is calculated as the sum of branch lengths calculated using a given model ME, like Maximum Parsimony, may generate a number of equal-scoring ME trees and may not actually result in the true tree Many other models have been invented to correct for specific problems

Ways to build trees with distance models: UPGMA UPGMA (unweighted pair-group method using arithmetic averages) Generally accurate for molecular evolution when substitution rates are relatively constant, but this can rarely be assumed to be true Method: distances for each pair of taxa are computed using the chosen distance method The pair with the smallest value d are combined into a single, composite taxon The distances from this composite taxon to all other taxa are computed The next pair with the smallest d is chosen (including consideration of pairings with the composite taxon)

Ways to build trees with distance models: Neighbor Joining Neighbor Joining (NJ) is a very robust method that is accurate even when substitution rates are not constant, and generally recovers the ME tree (although this is not always the case) Method: We construct a “star” tree and compute the sum of all branches, SO (this will be greater than the sum of all branches for the final tree, SF) We then pick a pair of taxa to be “neighbors”, (say, taxa 1 & 2) and compute the sum of all branches, S1,2 All other pairs of taxa are then placed as neighbors and the sum of all branches computed The neighbors whose pairing results in the greatest reduction in the sum of all branches will be kept Then, another round of neighbor joining is conducted, including using the neighbor pair retained in the first round

Example: The evolution of flight in stoneflies Reconstruction of the Plectoptera order (stoneflies) from 18S rRNA sequence Kimura 2-parameter distance used Tree rooted with known outgroup species Neighbor-Joining tree building method used to construct first tree; tree search was conducted to ensure that the NJ was also the ME tree Characters related to flight were then mapped onto the tree Defined outgroup taxa Scale, in substitutions/site

Maximum Likelihood The site likelihoods represent probability of data for one site given an assumed model of evolution Overall likelihood is the product of the site likelihoods Trees are evaluated by comparing log-likelihood scores Likelihood scores are comparable across models as well as trees, so it provides a way of testing the goodness of fit of a model The tree (out of all examined trees) with the lowest tree length is the maximum likelihood tree… and most likely to be the true tree

All material through next Tuesday (10/8) will be covered by the exam Examples of phylogenetic reconstructions Uses of phylogenetic trees Other research using molecular evolution Next Thursday: exam 1 All material through next Tuesday (10/8) will be covered by the exam