Leptothorax gredosi Leptothorax racovitzae Camponotus herculeanus 0.99 0.58 0.99 0.96 0.76 0.91 1.00 0.58 1.00 0.99 0.91 Thomas Bayes 1702-1761.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES Robert Hirt Institute for Cell and Molecular Biosciences, Newcastle University, UK.
Bayesian Estimation in MARK
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Practical Session: Bayesian evolutionary analysis by sampling trees (BEAST) Rebecca R. Gray, Ph.D. Department of Pathology University of Florida.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
FROM PROTEIN SEQUENCES TO PHYLOGENETIC TREES
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Positive selection A new allele (mutant) confers some increase in the fitness of the organism Selection acts to favour this allele Also called adaptive.
07/05/2004 Evolution/Phylogeny Introduction to Bioinformatics MNW2.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Bayesian Inference Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
7. Bayesian phylogenetic analysis using MrBAYES UST Jeong Dageum Thomas Bayes( ) The Phylogenetic Handbook – Section III, Phylogenetic.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Lab3: Bayesian phylogenetic Inference and MCMC Department of Bioinformatics & Biostatistics, SJTU.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Molecular Systematics
Bayesian Phylogenetics. Bayes Theorem Pr(Tree|Data) = Pr(Data|Tree) x Pr(Tree) Pr(Data)
 Tue Introduction to models (Jarno)  Thu Distance-based methods (Jarno)  Fri ML analyses (Jarno)  Mon Assessing hypotheses.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Ben Stöver WS 2012/2013 Ancestral state reconstruction Molecular Phylogenetics – exercise.
N=50 s=0.150 replicates s>0 Time till fixation on average: t av = (2/s) ln (2N) generations (also true for mutations with negative “s” ! discuss among.
Tree Searching Methods Exhaustive search (exact) Branch-and-bound search (exact) Heuristic search methods (approximate) –Stepwise addition –Branch swapping.
Bayes’ Theorem Reverend Thomas Bayes ( ) Posterior Probability represents the degree to which we believe a given model accurately describes the.
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Bayesian statistics named after the Reverend Mr Bayes based on the concept that you can estimate the statistical properties of a system after measuting.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Bayesian II Spring Major Issues in Phylogenetic BI Have we reached convergence? If so, do we have a large enough sample of the posterior?
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Phylip PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). PHYLIP is the most widely-distributed.
Lecture 14 – Consensus Trees & Nodal Support
Phylogenetics LLO9 Maximum Likelihood and Its Applications
IMa2(Isolation with Migration)
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Bayesian inference Presented by Amir Hadadi
Multidimensional Integration Part I
Bayesian inference with MrBayes Molecular Phylogenetics – exercise
The Most General Markov Substitution Model on an Unrooted Tree
DN/dS.
Reverend Thomas Bayes ( )
Lecture 14 – Consensus Trees & Nodal Support
Lecture 19: Evolution/Phylogeny
Presentation transcript:

Leptothorax gredosi Leptothorax racovitzae Camponotus herculeanus Thomas Bayes

Bayesian inference Computational phylogenetics CSC Mikko Kolkkala

How to read a tree?

Bayesian inference Only very recently phylogenetical applications (”Why”? We’ll return to that…) Controversial philosophy Subjective probability concept; degrees of belief measured as probabilities A learning process Prior and posterior probabilities Spam filters Subjective! Quack!

p = probability D = Data Θ = model/hypothesis/parameters | = read: ”provided that" Conditional probability: ”|”

p( a six | loaded die ) 1/2 An example Suppose we have ten identical looking dice, nine ordinary, one die loaded so that a six appears with probability 1/2. Let’s pick one die randomly. The probability of it being loaded is (of course) 1/10 (= prior) Next, we roll the die once - and get a six: What is the probability that we have picked the loaded die now? p( loaded die ) 1/10 p(a six) 1/2 1/10 + 1/6 9/10 == 1/4 (= posterior) p( loaded die | a six ) =

An exercise A reliable test? Test for a rare disease (prevalence 0.1 %): Disease - positive result with probability 0.99 No disease - positive result with probability What is the probability that the test is positive but the individual tested has not the disease? Answer: 0.98 (

p(data | model) p(model) p(data) p(model | data) = “loaded die"  model “a six"  data

From dice to biology: Data: DNA-alignment Models: nucleotide substitution models tree shape and branch lengths p(data | model) p(model) p(data) p(model | data) =

Posterior distribution Prior distribution Likelihood function

If this Bayesian thing is so excellent why hasn’t It been used in phylogenetic analyses? No-one can solve the equations! Numerical solutions possible - but only with powerful computers

MCMC = Markov Chain Monte Carlo Parameters Tree topology Branch lenghts Probabilities for nucleotide substitutions “ ”Exploring the tree space” Parameter space Probability © Fredrik Ronqvist

Metropolis-Coupled Markov Chain Monte Carlo MCMCMC = (MC) 3 “Heated chains" “Flattened" parameter landscape

© John Huelsenbeck (MC) 3

© John Huelsenbeck (MC) 3

© John Huelsenbeck (MC) 3 Swap of states

p-values directly No need for bootstrapping

F81 JC HKY85 K80 K81 TrN TVM TIM SYM GTR Standard models Substitution types: 1-6 Nucleotide frequences: equal/ estimated from the data Invariable sites: no/ estimate Evolutionary rate: equal/ Γ-distributed "+I" "+G" ETC.

.aaa a.aa aa.a aaa. A A CG G T T C π A =π c =π g =π T =1/4 JC Jukes-Cantor GTR General time-reversible model 0.75.

Characters independet? No way. Time reversible:G  C = C  G ? RNA-genes

SSR-models (site-specific rates) Different evolutionary rate for 1./2./3. positions of codons Problematic (see: Buckley ym Syst.Biol. 50:67-86) Coding regions

But – how to chooce the model? Well, nobody said it would be easy. 30 How many parameters Does it take to fit an elephant?

“What do you consider the largest map that would be really useful?" "About six inches to the mile." "Only six inches! […] We actually made a map of the country, on the scale of a mile to the mile!" (Lewis Carroll 1893)

Choosing a model AIC (Akaike information criterion) AICc (Consistent Akaike information criterion) BIC (Bayesian information criterion) Programs: Modeltest (bad) FindModel (plop!) MrAic ?

Redelings, B. D. & Suchard, M.A 2005: Joint Bayesian estimation of alignment and phylogeny. Syst. Biol. 54: Lunter, G. et al. 2005: Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6:83 Most commonly used program: MrBayes Future? Alignment and phylogeny co-estimation BAli-Phy (Redeling & Suchard 2005) Beast (Lunter et al. 2005)

Sweden cities World record Cities (N) Routes (N!) 10! = ! = 1.7 x Travelling salesman Find the shortest route through cities (another NP-complete problem) 84.8 CPU years How about studying them all? With rate million routes / sec. it would take 5x10 84 years ! = ?

Acknowledgements Fredrik Ronqvist John Huelsenbeck Wife and Mom

-Command-line interface -UNIX, Macintosh and PC platforms MrBayes Ronqvist, F. & Huelsenbeck, J. 2001: Bioinformatics 17: (2005: v. 3.1.)

Homepage Manual Wiki, FAQ Mailing list (archives) MrBayes

Running the analysis All you have to do: Type execute filename.nex * at the MrBayes > prompt and press enter * Replace filename.nex with your nexus-file containing MrBayes commands (type full path if the file is not in the same folder as MrBayes program).

#nexus begin data; dimensions ntax=6 nchar=20; format datatype=dna; matrix Otus1 aaaaaaaaaaaaaaaaaaaa Otus2 aaaaaaaaaaaaaaaaaaaa Otus3 aaaaaaaaaaaaaaaaaaaa Otus4 cccccccccccccccccccc Otus5 gggggggggggggggggggg Otus6 tttttttttttttttttttt ; end; begin mrbayes; mcmcp ngen= samplefreq=100; mcmc; end; MrBayes – an example nexus file

A real thing:

MrBayes After the run Summarize the parameter values, type: sump burnin= Summarize the trees, type: sumt burnin= With a proper burnin value

burn-in (C) Fredrik Ronqvist

MrBayes After the run Burnin discards initial values before the analysis reached convergence (burnin=2500 if you have run a million generations, sampled every 100th of them, and want to discard the first 25%) Note: you have to run “enough” generations -Check the plot generated by sump; there should be no obvious trends -The standard deviation of split frequencies should be less than 0.01.

Restriction: Can handle only 24 substitution models Command for example: lset nst=6 rates=invgamma MrBayes Models Confused? Try typing: help lset Priors, command: prset Defaults (try help prset) should work fine for most analysis

Cladistic parsimony Prefer the tree with the fewest number of evolutionary steps – only parsimony informative sites count Otus1 aaaaaaaaaaaaaaaaaaaa Otus2 aaaaaaaaaaaaaaaaaaaa Otus3 aaaaaaaaaaaaaaaaaaaa Otus4 cccccccccccccccccccc Otus5 gggggggggggggggggggg Otus6 tttttttttttttttttttt Otus1 Otus2 Otus3 Otus4 Otus5 Otus6

Fain ja Houde 2004: Evolution 58:

Exercises: 1. Study program defaults with help command (e.g. lset and prset) 2. Run program with a few arbitrary sequences (e.g. palikka.nex) -Try sump and sumt commands with different burnin values -Study the files made by the program – where is the tree? 3. Run program with some real data (e.g. your own or birds.txt) -Align sequences -Put them into a nexus file -Try to find out how to select JC, K2P and GTR model with gamma-distributed rate variation and without with correction for invariable sites and without -Try the model suggested by FindModel (AIC-criterion) - MrBayes