Juan Daza UCF Fall 2008 Juan Daza UCF Fall 2008 Estimating divergence times from molecular data.

Slides:



Advertisements
Similar presentations
Introduction to molecular dating methods. Principles Ultrametricity: All descendants of any node are equidistant from that node For extant species, branches,
Advertisements

Molecular Clocks Prediction of time from molecular divergence.
Chapter 7 Hypothesis Testing
Chapter 18: The Chi-Square Statistic
Simple Linear Regression Analysis
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Pattern Recognition and Machine Learning
An Introduction to Phylogenetic Methods
Introduction to Phylogenies
Practical Session: Bayesian evolutionary analysis by sampling trees (BEAST) Rebecca R. Gray, Ph.D. Department of Pathology University of Florida.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Evaluating Hypotheses
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Probabilistic methods for phylogenetic trees (Part 2)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Tree Inference Methods
Speciation history inferred from gene trees L. Lacey Knowles Department of Ecology and Evolutionary Biology University of Michigan, Ann Arbor MI
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
The Molecular Clock? By: T. Michael Dodson. Hypothesis For any given macromolecule (a protein or DNA sequence) the rate of evolution is approximately.
GENE 3000 Fall 2013 slides More geologists agree that the age of the Earth is ~4.5 billion years old geneticists have independent data suggesting.
PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London
PHYLOGENETICS CONTINUED TESTS BY TUESDAY BECAUSE SOME PROBLEMS WITH SCANTRONS.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogeny GENE why is coalescent theory important for understanding phylogenetics (species trees)? coalescent theory lets us test our assumptions.
Introduction to Phylogenetics
Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.
Calculating branch lengths from distances. ABC A B C----- a b c.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Lecture 16 – Molecular Clocks Up until recently, studies such as this one relied on sequence evolution to behave in a clock-like fashion, with a uniform.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Why phylogenetics? Barbara Holland School of Physical Sciences University of Tasmania.
NEW TOPIC: MOLECULAR EVOLUTION.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Bootstrap ? See herehere. Maximum Likelihood and Model Choice The maximum Likelihood Ratio Test (LRT) allows to compare two nested models given a dataset.Likelihood.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Bayesian Evolutionary Analysis by Sampling Trees (BEAST) LEE KIM-SUNG Environmental Health Institute National Environment Agency.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Phylogenetics and Coalescence. Goals Construct phylogenetic trees using the UPGMA method Use nucleotide sequences to construct phylogenetic trees using.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Ch. 26 Phylogeny and the Tree of Life. Opening Discussion: Is this basic “tree of life” a fact? If so, why? If not, what is it?
Reconstructing and Using Phylogenies 16. Concept 16.1 All of Life Is Connected through Its Evolutionary History All of life is related through a common.
Chapter 9 Introduction to the t Statistic
Inference about the slope parameter and correlation
Evolutionary genomics can now be applied beyond ‘model’ organisms
Lecture 16 – Molecular Clocks
In-Text Art, Ch. 16, p. 316 (1).
Multiple Alignment and Phylogenetic Trees
Molecular Clocks Rose Hoberman.
Summary and Recommendations
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Volume 19, Issue 5, Pages (May 2011)
Chapter 19 Molecular Phylogenetics
Volume 25, Issue 22, Pages (November 2015)
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Juan Daza UCF Fall 2008 Juan Daza UCF Fall 2008 Estimating divergence times from molecular data

Reconstructing the evolutionary process History of Life on earth

Reconstructing the evolutionary process History of Life on earth

Evolutionary process implies TIME We are interested in determine How, Where, Why, WHEN evolution occurs or has occurred Genetic data Molecular evolution theory Molecular dating

The general procedure of molecular dating PhylogramUltrametric tree

The evolution of molecular dating Hemoglobin example The term is introduced Neutral theory Statistical properties of clocks Fitchs test

Autocorrelation of rates Local clocks The evolution of molecular dating branch pruning NPRSBayesian Penalized likelihood Uncorrelated rates

The evolution of molecular dating

Amino acids Nucleotides Pruning branches Local clocks (PAML, Pathd8 packages) Relaxed clocks Correlated rates (r8s, Multidivtime) Uncorrelated rates (Beast)

Applications Species divergence Explosive radiations Gene evolution Rates estimation Virus epidemiology Historical demography bursts Time Log (# lineages)

The molecular clock hypothesis The hypothesis of the molecular clock proposes that molecular evolution occurs at rates that persist through time and across lineages Constant Burst The discovery of the molecular clock stands out as the most significant result of research in molecular evolution. Wilson et al., 1977

Emile Zuckerland and Linus Pauling …It is possible to evaluate very roughly and tentatively the time that has elapsed since any of the hemoglobin chains present in a given species and controlled by non-allelic genes diverged from a common chain ancestor.... From paleontological evidence it may be estimated that the common ancestor of man and horse lived in the Cretaceous or possibly the Jurassic period, say between 100 and 160 million years ago.... The presence of 18 differences between human and horse -chains would indicate that each chain had 9 evolutionary effective mutations in 100 to 160 millions of years. This yields a figure of 11 to 18 million years per amino acid substitution in a chain of about 150 amino acids, with a medium [sic] figure of 14.5 million years… Constant Burst Zuckerland and Pauling, 1962

Emile Zuckerland and Linus Pauling Constant Burst

The molecular clock hypothesis Constant Burst rate = number of substitutions per site per year number of substitutions per site Divergence time between species i and j Confidence interval

The molecular clock hypothesis Increasing of genetic data Quantification of rates Molecular evolution understanding Constant Framework for hypothesis testing

The molecular clock hypothesis Constant Differences in generation times Differences in population size Natural selection and its intensity Some biological attributes might be responsible:

Null hypothesis: the phylogeny is rooted and the branch lengths are constrained such that all of the tips can be drawn at a single time plane. Alternative hypothesis: each branch is allowed to vary independently. Chi-square distribution with 3 d.f. Log Likelihood ratio test

Amount of evolution BL = R*T What to do if the clock is rejected? Branch lengths

Error in topology Error in branch lengths Error in rates optimization Error in calibration PhylogramUltrametric tree What to do if the clock is rejected?

…Go simple Eliminate branches (lineages) that are causing the clock to be rejected What to do if the clock is rejected?

Objective functions need to be developed to reduce dimensionality

Global clock to Local clocks Assign specific rates to specific parts of the tree and calculate divergence times Packages: PAML Pathd8 r1r1 r2r2

…what if still doesnt work? We need to find the function that explain the data better. Relaxed clock methods Maximum Likelihood and Bayesian Inference Uncorrelated relaxed clocks Correlated relaxed clocks

Penalized Likelihood Method (Sanderson, 2002) A likelihood method to generate an ultrametric chronogram from a non-ultrametric tree Finds the best fitting model of rate evolution considering both: 1.how well modeled changes explain the branch lengths 2.The amount of rate changes across the tree (less change = better) Rates correlation

Penalized Likelihood Method (Sanderson, 2002) A topology with branch lengths is required. Absolute or relative dates can be obtained. Bootstrap method is used for confidence intervals (time consuming!!!) Fossil cross validation

Penalized Likelihood Method (Sanderson, 2002) Maximizes the sequence data (X) on a combination of average rates (R) and time (T) with a penalty function to discourage rate change. Likelihood Penalty function

Number of pseudoreplicates Mean date for the same node from all bootstrap pseudoreplicates Estimate of time for a single node from single bootstrap pseudoreplicate Standard error of a bootstrap distribution Confidence intervals for Penalized Likelihood (Burbrink and Pyron, 2008)

)( )()(),()( ),,,( CXp vpCTpvTRpBXp CXvRTp Posterior Likelihood Prior marginal p of the data agestreeparameters constraints Bayesian Inference (Thorne and Kishino, 2000; Drummond et al., 2006) Uses the bayes rule to estimate rates and dates

Bayesian Inference (Thorne and Kishino, 2000; Drummond et al., 2006) BL=0.065 subs/site BL=R*T

Bayesian Inference (Thorne and Kishino, 2000; Drummond et al., 2006) r=0.1 t=0.65

Bayesian Inference (Thorne and Kishino, 2000; Drummond et al., 2006) BL=0.065 subs/site

Bayesian Inference (Thorne and Kishino, 2000; Drummond et al., 2006) Prior BL=0.065 subs/site

Bayesian Inference (Thorne and Kishino, 2000; Drummond et al., 2006) Prior Posterior BL=0.065 subs/site

Thorne and Kishino, 1998 BL=0.065 subs/site A topology is required. Branch lengths are estimated using the F84 model Variance-covariance matrix of the branch lengths are also estimated Several priors (e.g., time constraints, rates) can be included MCMC methods are implemented to sample from the posterior

Drummond et al., 2006 BL=0.065 subs/site A topology is not required. Phylogeny and dates are estimated simultaneously. More complex models can be applied. Several priors (e.g., time constraints, rates) can be included. Distributions do not need to be normal. MCMC methods are implemented to sample from the posterior

Coalescent theory and molecular dating Coalescent A stochastic process that describes how population genetic processes determine the shape of the genealogy of sampled gene sequences. + Molecular dating Test hypotheses about historical demography

Coalescent theory and molecular dating Coalescent A stochastic process that describes how population genetic processes determine the shape of the genealogy of sampled gene sequences. + Molecular dating Test hypotheses about historical demography E O

Coalescent theory and molecular dating Coalescent A stochastic process that describes how population genetic processes determine the shape of the genealogy of sampled gene sequences. + Molecular dating Test hypotheses about historical demography HCV Bison

The methods seems to be more realistic but… Are they more accurate in the real world? How do we know if a method is appropiate??

Uncertainty of phylogenetic relationships. Rates of evolution are unknown for many organisms. Rate heterogeneity no molecular clock. Lack of calibration points (fossils, biogeographic events). BL = R*T There are many factors that can affect divergence times

Gene tree vs. species tree Coalescent times Divergence times Time of cladogenetic event = TMRCA

New World Crotalinae

Fossil quality Calibration Error includes several components: Fossil misidentified (belongs elsewhere and calibrates a different node) Fossil mis-dated (uncertainty in determining absolute age of fossil) Non-preservation (fossil never gives true origin - impossible to avoid)

Fossil cross-validation (Near et al., 2005) Test the effect of each fossil on the time estimates We left one fossil and re- estimated dates of remaining fossils using r8s Consistent Inconsistent Fossil quality

Parameters: Average difference between molecular ages and fossil ages Sumsquares of differences Standard deviation Effect of removing inconsistent fossils Fossils inconsistency Fossil quality

Number of fossils removed s Inconsistency Fossil calibration Fossil 1 Overestimation underestimation Best ? Fossil quality

Use of all fossils Different values of (parameter that relaxes the molecular clock using Penalized Likelihood) Estimation of divergence time using r8s Rate heterogeneity

Cross-validation score Substitution rate ratio Log ( ) Clock behavior Rate heterogeneity

5 different outgroups depending of its distance to the ingroup (number of internal branches) Optimization of branch lenghts using likelihood and the GTR+ +I model Estimation of divergence time using the Mean Path Length method Pathd8 ingroup outgroup 1 outgroup 2 outgroup 3 Outgroup choice

Node Estimated age Outgroup choice

The saturation problem 2 nd codon position 3 rd codon position GTR dist Uncorrected dist

Target Calibration Calibration BELOW the target OVERESTIMATION The saturation problem

Target Calibration Calibration ABOVE the target UNDERESTIMATION The saturation problem

Parameters required to derive posterior densities Phylogenetics topology, node support DTE credibility intervals of dates Implemented in the Multidistribute package (baseml, estbranches, multidivtime) We tested: Time expected rttm, rttmsd Rate expected rtrate,rtratesd Bigtime brownmean minab Fossils Model priors

rttm rttmsd Model priors

bigtime rtrate Model priors

brownmean minab Model priors

fossils with without w w/o Model priors

mean Locus

717 bp 669 bp 417 bp503 bp SD Locus

M SD LowerUpper Partitioned vs unpartitioned Locus

DateUpper MCMC

The final result…you hope is the best estimate!!!!

MY final remarks Hedges is always wrong!! Graur and Martin were wrong!!! Ok, to some extent! Time estimation using molecular data is a very useful tool in the advance of evolutionary theory Divergence time estimation procedures should to take into account factors different than violations of molecular clock assumptions in order to avoid spurious results.