Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30.

Slides:



Advertisements
Similar presentations
Phylogenies and the Tree of Life
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Phylogenetic Trees Lecture 4
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Clock I. Evolutionary rate Xuhua Xia
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Tree Reconstruction Basic Principles of Phylogenetics Distance Parsimony Compatibility Inconsistency Likelihood.
Schedule Day 1: Molecular Evolution Introduction Lecture: Models of Sequence Evolution Practical: Phylogenies Chose Project and collect literature Read.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
I. P(0) = I Q and P(t) What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q? vi. QE=0 E ij =1 (all i,j) vii. PE=E viii.
Ln(7.9* ) –ln(6.2* ) is  2 – distributed with (n-2) degrees of freedom Output from Likelihood Method. Likelihood: 6.2*  = 0.34.
Advanced Questions in Sequence Evolution Models Context-dependent models Genome: Dinucleotides..ACGGA.. Di-nucleotide events ACGGAGT ACGTCGT Irreversibility.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Probabilistic methods for phylogenetic trees (Part 2)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Model Selection Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Interpretation of exponentiation + eigenvalue decomposition The terms in the series expansion of P(t) does not directly have an interpretation. The first,
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) The Parsimony criterion GKN Stochastic Models of.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Estimating evolutionary parameters for Neisseria meningitidis Based on the Czech MLST dataset.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Schedule Day 1: Molecular Evolution Lecture: Models of Sequence Evolution and Statistical Alignment Practical: Molecular Evolution.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Modelling evolution Gil McVean Department of Statistics TC A G.
Evolutionary genomics can now be applied beyond ‘model’ organisms
Models of Sequence Evolution
Methods of molecular phylogeny
Molecular basis of evolution.
Schedule The Parsimony criterion GKN 13.10
Patterns in Evolution I. Phylogenetic
The Most General Markov Substitution Model on an Unrooted Tree
Presentation transcript:

Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: Molecular Basis and Models I (JH) Computer : PAUP : Distance/Parsimony/Compatibility (JH/IH) Lecture 2 : Molecular Basis and Models II (JH) Lecture 3: The Origin of Life (JH/ Miklos) Tuesday 4.11: Tree of Life Lecture 1: Molecular Evolution of Eukaryote Pathogens (Day/Barry) Lecture 2: Molecular Evolution of Prokaryote Pathogens (Maiden) Computer: Analysis of Viral Data (Taylor) Lecture 3: Molecular Evolution of Virus (E.Holmes) Wednesday 5.11: Stochastic Models of Evolution & Phylogenies Computer : PAUP/Mr. Bayes: Likelihood (JH/IH) Lecture 1: The Evolution of Protein Structures (Deane) Computer: PAML:Testing Evolutionary Models (JH/Lyngsoe) Lecture 2: Molecular Evolution & Function/Structure/Selection(Meyer) Thursday 6.11: More Phylogenies Computer : Molecular Evolution on the web (JH/Lyngsoe) Lecture 2: Beyond Phylogenies: Networks & Recombination (Song/JH) Computer: Beyond Phylogenies (Song) Lecture 3: Molecular Evolution and the Genomes. (JH/Lunter) Friday 7.11: Results, Advanced Topics and article discussion Computer: Statistical Alignment (JH/IM) Lecture: Article Discussion/Presentation by students The Last Lunch

Two Discussion Articles 1. Timing the ancestor of the HIV-1 pandemic strains. Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. Science Jun 9;288(5472): Sequencing and comparison of yeast species to identify genes and regulatory elements. Kells, M., N.Patterson, M.Endrizzi & E.Lander Nature May vol

The Data & its growth. 1976/79 The first viral genome –MS2/  X The first prokaryotic genome – H. influenzae 1996 The first unicellular eukaryotic genome - Yeast 1997 The first multicellular eukaryotic genome – C.elegans 2001 The human genome 2002 The Mouse Genome : Known >1000 viral genomes 96 prokaryotic genomes 16 Archeobacterial genomes A series multicellular genomes are coming. A general increase in data involving higher structures and dynamics of biological systems

The Nucleotides Pyremidines Purines Transversions Transitions

The Amino Acids/Codons/Genes {nucleotides} 3  amino acids, stop

Major Application Areas of Molecular Evolution Phylogenies and Classification Rates of Evolution & The Molecular Clock Dating Functional Constraint – Negative Selection. Positive/Diversifying Selection Structure RNA Structure Gene Finding Homing in on Important Genes Homology Searches Disease Gene Mapping

The Tree (?) of Life LUCA Prokaryotes Eukaryotes Archea Origin of Life Viruses ?? PlantFungiAnimals

Tree of Life. Science vol.300 June 2003

The Origin of Life When did life originate? Is the present structure a necessity or is it random accident? How frequent is life in the Universe? “+”: “-”: Self replication easy Self assembly easy Many extrasolar planets Hard to make proper polymerisation No convincing scenario. No testability Increased Origin Research: In preparation of future NASA expeditions. The rise of nano biology. The ability to simulate larger molecular systems

Central Principles of Phylogeny Reconstruction Parsimony Distance Likelihood TTCAGT TCCAGT GCCAAT s2 s1 s4 s3 s2 s1 s4 s3 s2 s1 s4 s Total Weight: L=3.1*10 -7 Parameter estimates

From Distance to Phylogenies What is the relationship of a, b, c, d & e? A b c d e A B C D e Molecular clock No Molecular clock

Enumerating Trees: Unrooted & valency Recursion: T n = (2n-5) T n-1 Initialisation: T 1 = T 2 = T 3 =1

Heuristic Searches in Tree Space Nearest Neighbour Interchange Subtree regrafting Subtree rerooting and regrafting T2T2 T1T1 T4T4 T3T3 T2T2 T1T1 T4T4 T3T3 T2T2 T1T1 T4T4 T3T3 T4T4 T3T3 s4 s5 s6 s1 s2 s3 T4T4 T3T3 s4 s5 s6 s1 s2 s3 T4T4 T3T3 s4 s5 s6 s1 s2 s3 T4T4 T3T3 s4 s5 s6 s1 s2 s3

Assignment to internal nodes: The simple way. C A C C A C T G ? ? ? ? ? ? What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N 1,N 2 )?? If there are k leaves, there are k-2 internal nodes and 4 k-2 possible assignments of nucleotides. For k=22, this is more than

5S RNA Alignment & Phylogeny Hein, tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta 17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t- 14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c- 11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c- 15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t- 12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t- 16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t- 18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c- 13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt Transitions 2, transversions 5 Total weight 843.

Cost of a history - minimizing over internal states A C G T d(C,G) +w C (left subtree)

Cost of a history – leaves (initialisation). A C G T G A Empty Cost 0 Empty Cost 0 Initialisation: leaves Cost(N)= 0 if N is at leaf, otherwise infinity

Fitch-Hartigan-Sankoff Algorithm The cost of cheapest tree hanging from this node given there is a “C” at this node A C T G 2 5 (A,C,G,T) * 0 * * (A,C,G,T) * * * 0 (A,C,G,T) * * 0 * (A, C, G,T) (10,2,10,2) (A,C,G,T) (9,7,7,7)

The Felsenstein Zone Felsenstein-Cavendar (1979) s4 s3 s2 s1 Patterns:(16 only 8 shown) True Tree Reconstructed Tree s3 s1 s2 s4

Bootstrapping Felsenstein (1985) ATCTGTAGTCT ATCTGTAGTCT ??????????

The Molecular Clock First noted by Zuckerkandl & Pauling (1964) as an empirical fact. How can one detect it? Known Ancestor, a, at Time t s1 s2 a Unknown Ancestors s1 s2 s3 ??

1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data Rootings Purpose 1) To give time direction in the phylogeny & most ancient point 2) To be able to define concepts such a monophyletic group. 2) Midpoint: Find midpoint of longest path in tree. 3) Assume Molecular Clock.

Rooting the 3 kingdoms 3 billion years ago: no reliable clock - no outgroup Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? E P A E P A Root?? E P A LDH/MDH Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? E P A E P A LDH MDH

time Contemporary sample no time structure Serial sample with time structure RNA viruses like HIV evolve fast enough that you can’t ignore the time structure Non-contemporaneous leaves. (A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics ) From Drummond

Pt.7 Pt.9 HIV1U36148 HIV1U36015 HIV1U35980 HIV1U36073 HIV1U35926 HIVU95460 Pt.2 Patient #6 from Wolinsky et al. Pt.5 Pt.3 Pt.1 Pt.8 Pt.6 10% Shankarappa et al (1999) Years Post Seroconversion Viral Divergence 2% 4% 6% 8% 10% From Drummond HIV-1 (env) evolution in nine infected individuals

Lineage A Lineage B ‘Ladder-like’ appearance N e = [4000,6300] Mu = [0.8% – 1%] per site year 210 sequences collected over a period of 9.5 years 660 nucleotides from env: C2-V5 region Only first 285 (no alignment ambiguities) were used in this analysis Effective population size and mutation rate were co-estimated using Bayesian MCMC. From Drummond A tree sampled from the posterior distribution of Shankarappa Patient

Models of Amino Acid, Nucleotide & Codon Evolution Amino Acids, Nucleotides & Codons Continuous Time Markov Processes Specific Models Special Issues Context Dependence Rate Variation

The Purpose of Stochastic Models. 1.Molecular Evolution is Stochastic. 2. To estimate evolutionary parameters, not observable directly: i. Real number of events in evolutionary history. ii. Rates of different kinds of events in evolutionary history. iii. Strength of selection against amino acid changing nucleotide substitutions. iv. Estimate importance of different biological factors. 3.Survive a goodness of fit test. 4. Serve these purposes as simply as possible.

ACGTC Central Problems: History cannot be observed, only end products. Comment: Even if History could be observed, the underlying process couldn’t ACGCC AGGCC AGGCT AGGTT ACGTC ACGCC AGGCC AGGCT AGGTT AGGGC AGTGC

Principle of Inference: Likelihood Likelihood function L() – the probability of data as function of parameters: L( ,D) LogLikelihood Function – l(): ln(L( ,D)) If the data is a series of independent experiments L() will become a product of Likelihoods of each experiment, l() will become the sum of LogLikelihoods of each experiment In Likelihood analysis parameter is not viewed as a random variable.

Likelihood and logLikelihood of Coin Tossing From Edwards (1991) Likelihood

Principle of Inference: Bayesian Analysis In Bayesian Analysis the parameters are viewed as stochastic variables that has a prior distribution before observing data. Data depend on the parameters and after observing the data, the parameters will have a posterior distribution.

2) Processes in different positions of the molecule are independent, so the probability for the whole alignment will be the product of the probabilities of the individual patterns. Simplifying Assumptions I Data: s1=TCGGTA,s2=TGGTT 1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT TGGTT TCGGTA Probability of Data a - unknown Biological setup T T a1a1 a2a2 a3a3 a4a4 a5a5 G G T T C G G A

Simplifying Assumptions II 3) The evolutionary process is the same in all positions 4) Time reversibility: Virtually all models of sequence evolution are time reversible. I.e. π i P i,j (t) = π j P j,i (t), where π i is the stationary distribution of i and P t (i->j) the probability that state i has changed into state j after t time. This implies that P a,N1 (l 1 )*P a,N2 (l 2 ) = P N1,N2 (l 1 +l 2 ) = a N1N1 N2N2 l 2 +l 1 l1l1 l2l2 N2N2 N1N1

Simplifying assumptions III 6) The rate matrix, Q, for the continuous time Markov Chain is the same at all times (and often all positions). However, it is possible to let the rate of events, r i, vary from site to site, then the term for passed time, t, will be substituted by r i *t. 5) The nucleotide at any position evolves following a continuous time Markov Chain. T O A C G T F A -(q A,C +q A,G +q A,T ) q A,C q A,G q A,T R C q C,A -(q C,A +q C,G +q C,T ) q C, G q C,T O G q G,A q G,C -(q G,A +q G,C +q G,T ) q G,T M T q T,A q T,C q T,G -(q T,A +q T,C +q T,G ) P i,j (t) continuous time markov chain on the state space {A,C,G,T}. Q - rate matrix: t1t1 t2t2 C C A 

i. P(0) = I. ii. P(  ) close to I+  Q for  small. iii. P'(0) = Q. iv. lim P(t) has the equilibrium frequencies of the 4 nucleotides in each row. v. Waiting time in state j, T j, P(T j > t) = e -(q jj t) vi. QE=0 E ij =1 (all i,j) vii. PE=E viii If AB=BA, then e A+B =e A e B. Q and P(t) What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q?

Rate-matrix, R: T O A C G T F A  R C  O G  M T  Transition prob. after time t, a =  *t: P(equal) = ¼(1 + 3e -4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e -4*a ) ~ 3a Stationary Distribution: (1,1,1,1)/4. Jukes-Cantor 69: Total Symmetry

Geometric/Exponential Distributions The Geometric Distribution: {0,1,..} Geo(p): P{Z=j)=p j (1-p) P{Z>j)=p j E(Z)=1/p. The Exponential Distribution: R + Exp(  ) Density: f(t) =  e -  t, P(X>t)= e -  t Properties: X ~ Exp(  ) Y ~ Exp(  ) independent i. P(X>t 2 |X>t 1 ) = P(X>t 2 -t 1 ) (t 2 > t 1 ) Markov (memoryless) process ii. E(X) = 1/ . iii. P(Z>t)=(≈)P(X>t) small a (p=e -a ). iv. P(X < Y) =  /(  +  ). v. min(X,Y) ~ Exp (  ). N Mean 2.5

Comparison of Pairs of Nucleotides/Sequences C G All Evolutionary Paths: C G Shortest Path C G Sample Paths according to their probability: CTACGT GTATAT All Evolutionary Paths: Higher Cells ChimpMouse Fish E.coli ATTGTGTATATAT….CAG ATTGCGTATCTAT….CCG

From Q to P for Jukes-Cantor

TO A C G T F A -  R C  O G  M T   a =  *t b =  *t Kimura 2-parameter model start Q: P(t):

Unequal base composition: (Felsenstein, 1981) Q i,j = C*π j i unequal j Felsenstein81 & Hasegawa, Kishino & Yano 85 Transition/transversion & compostion bias (Hasegawa, Kishino & Yano, 1985) (  )*C*π j i- >j a transition Q i,j = C*π j i- >j a transversion

Dayhoffs empirical approach (1970) Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed. If q ij =q ji, then equilibrium frequencies,  i, are all the same. The transformation q ij -->  i q ij /  j, then equilibrium frequencies will be  i.

Measuring Selection ThrSer ACGTCA Pro ThrPro ACGCCA ThrSer ACGCCG ArgSer AGGCCG ThrSer ACTCTG AlaSer GCTCTG Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest. AlaSer GCACTG - - The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important I

The Genetic Code i. 3 classes of sites: Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another. 4 (3 rd ) (3 rd ) ii. T  A (2 nd )

Possible events if the genetic code remade from Li,1997 N Substitutions Number Percent Total in all codons Synonymous Nonsynonymous Missense Nonsense 23 4 Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides).

Ser Thr Glu Met Cys Leu Met Gly Thr TCA ACT GAG ATG TGT TTA ATG GGG ACG *** * * * * * * ** GGG ACA GGG ATA TAT CTA ATG GGT AGC Ser Thr Gly Ile Tyr Leu Met Gly Ser K s : Number of Silent Events in Common History K a : Number of Replacement Events in Common History N s : Silent positions N a : replacement positions. Rates per pos: ((K s /N s )/2T) Example: K s =100 N s = 300 T=10 8 years Silent rate (100/300)/2*10 8 = 1.66 * /year/pos. Synonyous (silent) & Non-synonymous (replacement) substitutions Thr ACG Arg AGG Thr ACC Ser AGC Miyata: use most silent path for calculations. * * *

Kimura’s 2 parameter model & Li’s Model.      start Selection on the 3 kinds of sites (a,b)  (?,?) (f* ,f*  ) 2-2 ( ,f*  ) 4 ( ,  ) Rates: Probabilities:

Sites Total Conserved Transitions Transversions (.8978) 12(.0438) 16(.0584) (.6623) 21(.2727) 5(.0649) (.6026) 16(.2051) 15(.1923) Z(  t,  t) =.50[1+exp(-2  t) - 2exp(-t(  +  )] transition Y(  t,  t) =.25[1-exp(-2  t )] (transversion) X(  t,  t) =.25[1+exp(-2  t) + 2exp(-t(  )] identity L(observations,a,b,f)= C(429,274,77,78)* {X(a*f,b*f) 246 *Y(a*f,b*f) 12 *Z(a*f,b*f) 16 }* {X(a,b*f) 51 *Y(a,b*f) 21 *Z(a,b*f) 5 }*{X(a,b) 47 *Y(a,b) 16 *Z(a,b) 15 } where a = at and b = bt. Estimated Parameters: a = b = *b = (a + 2*b) = f = Transitions Transversions a*f = *b*f = a = *b*f = a = *b = Expected number of: replacement substitutions synonymous Replacement sites : (0.3742/0.6744)*77 = Silent sites : = K s =.6644 K a =.1127 alpha-globin from rabbit and mouse. Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile

Hasegawa, Kisino & Yano Subsitution Model Parameters: a*t β*t  A  C  G  T Selection Factors GAG0.385(s.d ) POL0.220(s.d ) VIF0.407(s.d ) VPR0.494(s.d ) TAT1.229(s.d ) REV0.596(s.d ) VPU0.902(s.d ) ENV0.889(s.d ) NEF0.928(s.d ) Estimated Distance per Site: HIV2 Analysis

Examples of rates remade from Li,1997 N RNA Virus Influenza A Hemagglutinin Hepatitis C E HIV 1 gag DNA virus Hepatitis B P Herpes Simplex Genome Nuclear Genes Mammals c-mos Mammals a-globin Mammals histone Organism Gene Syno/year Non-Syno/Year

i.Codons as the basic unit. ii. A codon based matrix would have (61*61)-61 (= 3661) off-diagonal entries. i. Bias in nucleotide usage. ii. Bias in codon usage. iii. Bias in amino acid usage. iv. Synonymous/non-synonymous distinction. v. Amino acid distance. vi. Transition/transversion bias. codon i and codon j differing by one nucleotide, then  p j exp(-d i,j /V) differs by transition q i,j =  p j exp(-d i,j /V) differs by transversion. -d i,j is a physico-chemical difference between amino acid i and amino acid j. V is a factor that reflects the variability of the gene involved. Codon based Models Goldman,Yang + Muse,Gaut

Rate variation between sites:iid each site i)The rate at each position is drawn independently from a distribution, typically a  (or lognormal) distribution. G(a,b) has density x  -1 *e -  x /  ), where  is called scale parameter and  form parameter. Let L(p i, ,t) be the likelihood for observing the i'th pattern, t all time lengths,  the parameters describing the process parameters and f (r i ) the continuous distribution of rate(s). Then

What is the probability of the data? What is the most probable ”hidden” configuration? What is the probability of specific ”hidden” state? 1)Different positions in the molecule evolves at different rates. For instance fast or slow r F or slow r S. 2) The rates at neighbor positions evolve at the same rate. Rate variation between sites:iid Hidden Markov Chains O 1 O 2 O 3 O 4 O 5 O 6 O 7 O 8 O 9 O 10 F S

Data: 3 sequences of length L ACGTTGCAA... AGCTTTTGA... TCGTTTCGA... Statistical Test of Models (Goldman,1990) A. Likelihood (free multinominal model 63 free parameters) L1 = p AAA #AAA *...p AAC #AAC *...*p TTT #TTT where p N 1 N 2 N 3 = #(N 1 N 2 N 3 )/L L 2 = p AAA (l1',l2',l3') #AAA *...*p TTT (l1',l2',l3') #TTT l2l2 l1l1 l3l3 TCGTTTCGA... ACGTTGCAA... AGCTTTTGA... B. Jukes-Cantor and unknown branch lengths Test statistics: I. (expected-observed)2/expected or II: -2 lnQ = 2(lnL 1 - lnL 2 ) JC69 Jukes-Cantor: 3 parameters =>  2 60 d.of freedom Problems: i. To few observations pr. pattern. ii. Many competing hypothesis. Parametric bootstrap: i. Maximum likelihood to estimate the parameters. ii. Simulate with estimated model. iii. Make simulated distribution of -2 lnQ. iv. Where is real -2 lnQ in this distribution?

Emperical Observations: i. Variance/Mean > 1 (clumpy process) for non-synonymous event Possible explanations: i. Selective Avalances. ii. Gene conversions from pseudogenes. Episodic Evolution Poisson Process: i. T i 's independent, exponentially distributed with same parameter (l). ii. Variance and Mean both l.

Assignment to internal nodes: The simple way. C A C C A C T G ? ? ? ? ? ? If branch lengths and evolutionary process is known, what is the probability of nucleotides at the leaves? Cctacggccatacca a ccctgaaagcaccccatcccgt Cttacgaccatatca c cgttgaatgcacgccatcccgt Cctacggccatagca c ccctgaaagcaccccatcccgt Cccacggccatagga c ctctgaaagcactgcatcccgt Tccacggccatagga a ctctgaaagcaccgcatcccgt Ttccacggccatagg c actgtgaaagcaccgcatcccg Tggtgcggtcatacc g agcgctaatgcaccggatccca Ggtgcggtcatacca t gcgttaatgcaccggatcccat

Probability of leaf observations - summing over internal states A C G T P(C  G) *P C (left subtree)

ln(7.9* ) –ln(6.2* ) is  2 – distributed with (n-2) degrees of freedom Output from Likelihood Method. s1s2 s3s4 s5 Now Duplication Times Amount of Evolution s1 s2 s3 s4 s5 Likelihood: 6.2*  = Likelihood: 7.9*  = Molecular ClockNo Molecular Clock 23 -/ / / / / / / / / /+2.1 n-1 heights estimated 2n-3 lengths estimated 4.1 -/+0.7

The generation/year-time clock Langley-Fitch,1973 s1 s3 s2 s1s3 s2 {l 1 = l 2 < l 3 } l2l2 l1l1 l3l3 l3l3 Some rooting techniquee Absolute Time Clock: Generation Time Clock: Absolute Time Clock Generation Time Elephant Mouse 100 Myr variable constant l 1 = l 2

The generation/year-time clock Langley-Fitch,1973 s1s3 s2 Any Tree Generation Time Clock Can the generation time clock be tested? Assume, a data set: 3 species, 2 sequences each s1 s3 s2 s1 s3 s2 s1s3 s2

The generation/year-time clock Langley-Fitch,1973 s1 s3 s2 l2l2 l1l1 l3l3 s1 s3 s2 c*l 2 c*l 1 c*l 3 s1 s3 s2 s1s3 s2 l2l2 l1l1 l3l3 l 1 = l 2 l3l3 k=3: degrees of freedom: 3 dg: 2 k: dg: 2k-3 dg: k-1 k=3, t=2: dg=4 k, t: dg =(2k-3)-(t-1)

 – globin, cytochrome c, fibrinopeptide A & generation time clock Langley-Fitch,1973 N Fibrinopeptide A phylogeny: Human Gorilla Donkey GibbonMonkey Rabbit Cow Rat Pig Horse GoatLlamaSheep Dog Relative rates  -globin  – globin cytochrome c fibrinopeptide A 0.137

I Smoothing a non-clock tree onto a clock tree (Sanderson). II Rate of Evolution of the rate of Evolution (Thorne et al.). The rate of evolution can change at each bifurcation. III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed) Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol ), J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12) , JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics ) Comment: Makes perfect sense. Testing no clock versus perfect is choosing between two unrealistic extremes.

Summary Phylogeny Principles of Phylogenies Rates of Molecular Rates and the Molecular Clock Rooting Phylogenies The Generation Time Clock Almost Clocks Non-Contemporaneous Leaves (Viruses & Ancient DNA) The Purpose of Stochastic Models The assumptions of Stochastic Models The Central Models Measuring Selection Variation among sites Testing Models.

History of Phylogenetic Methods & Stochastic Models 1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza Zuckerkandl and Pauling introduces the notion of a Molecular Clock First large molecular phylogenies by Fitch and Margoliash Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences Jukes-Cantor proposes simple model for amino acid evolution. 1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”. 1979: Kimura introduces transition/transversion bias in nucleotide model in response to pbulication of mitochondria sequences. 1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). Simple nucleotide model with equilibrium bias.

1981 Parsimony tree problem is shown to be NP-Complete. 1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies. 1985: Hasegawa, Kishino and Yano combines transition/transversion bias with unequal equilibrium frequencies Bandelt and Dress introduces split decomposition as a generalization of trees : Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies Gillespie’s book proposes “lumpy” evolution Goldman & Yang + Muse & Gaut introduces codon based models Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves Complex Context Dependent Models by Jensen & Pedersen. Dinucleotide and overlapping reading frames Major rise in the interest in phylogenetic statistical alignment Comparative genomics underlines the functional importance of molecular evolution.

References: Books & Journals Joseph Felsenstein "Inferring Phylogenies” 660 pages Sinauer 2003 Excellent – focus on methods and conceptual issues. Masatoshi Nei, Sudhir Kumar “Molecular Evolution and Phylogenetics” 336 pages Oxford University Press Inc, USA 2000Molecular Evolution and Phylogenetics” R.D.M. Page, E. Holmes “Molecular Evolution: A Phylogenetic Approach” 352 pages 1998 Blackwell Science (UK)“Molecular Evolution: A Phylogenetic Approach” Dan Graur, Li Wen-Hsiung “Fundamentals of Molecular Evolution” Sinauer Associates Incorporated 439 pages 1999“Fundamentals of Molecular Evolution” Margulis, L and K.V. Schwartz (1998) “Five Kingdoms” 500 pages Freeman A grand illustrated tour of the tree of life Semple, C and M. Steel “Phylogenetics” pages Oxford University Press Very mathematical Journals Journal of Molecular Evolution : Molecular Biology and Evolution : Molecular Phylogenetics and Evolution : Systematic Biology - J. of Classification -

References: www-pages Tree of Life on the WWW Software Data & Genome Centres

Next Classification of Viruses * Overhead with considerations model  > data. Example : HMM variation in rates, gamma rates. Example: Almost clock Example: Episodic clock Example: Bootstrapping. *