Lecture 3 Molecular Evolution and Phylogeny
Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers of apparently homlogous intra-genomic (paralog) and inter-genomic (ortholog) genes Some genes, especially those related to the function of transcription and translation, are common to ALL life forms The closer two organisms seem to be phylogenetically, the more similar their genomes and corresponding genes are
Central dogma of molecular biology DNA RNA Protein
Closer related organisms have more similar genomes Highly similar genes are homologs (have the same ancestor) A universal ancestor exists for all life forms Molecular difference in homologous genes (or protein sequences) are positively correlated with evolution time Phylogenetic relation can be expressed by a dendrogram (a “tree”) Basic assumptions of molecular evolution
The five steps in phylogenetics dancing Modified from Hillis et al., (1993). Methods in Enzymology 224, Sequence data Align Sequences Phylogenetic signal? Patterns—>evolutionary processes? Test phylogenetic reliability Distances methods Choose a method MBML Characters based methods Single treeOptimality criterion Calculate or estimate best fit tree LSMENJ Distance calculation (which model?) Model? MP Wheighting? (sites, changes)? Model?
Why protein phylogenies? For historical reasons - first sequences... For historical reasons - first sequences... Most genes encode proteins... Most genes encode proteins... To study protein structure, function and To study protein structure, function and evolution evolution Comparing DNA and protein based Comparing DNA and protein based phylogenies can be useful phylogenies can be useful Different genes - e.g. 18S rRNA versus EF-2 proteinDifferent genes - e.g. 18S rRNA versus EF-2 protein Protein encoding gene - codons versus amino acidsProtein encoding gene - codons versus amino acids
Protein were the first molecular sequences to be used for phylogenetic inference Fitch and Margoliash (1967) Construction of phylogenetic trees. Science 155,
Statistical Physics and Biological Information Institute of Theoretical Physics University of California at Santa Barbara 2001 May 7 Most of what follows taken from:
Understanding trees Time 30 Mya Root 22 Mya 7 Mya same as
Understanding trees #2
Understanding trees #3
Difference in homologous sequences is a measure of evolution time Part of multiple sequence alignment of Mitochondrial Small Sub-Unit rRNA Full length is ~ primate species with mouse as outgroup 靈長目 Change similarity matrix to distance matrix : d = 1 - S
From alignment construct pairwise distance* *Note: Alignment is not the only way to compute distance
Models of sequence evolution
Jukes-Cantor (minimal) Model All substitution rates = all base frequency = 1/4 AC = 3 P ij (2t)
Let probability of site being a base at time t be P(t) After elapse time t mutate to other three bases is – 3 t P(t) Gain from other bases is t (1 - P(t)) Hence P(t + t) = P(t) – 3 t P(t) + t (1 - P(t)) dP(t)/dt = P(t) Write P(t) = a exp(-bt) +c, solution is b= , c=1/4 P(t) = a exp(- t) +1/4 If P(0) = 1, then a = ¾. If P(0) = 0, then a = -1/4 Finally P same (t) =1/4 +3/4 exp(- t) P change (t) =1/4 - 1/4 exp(- t) Derivation of Jukes-Cantor formula
Transition A G or C T Transversion A T or C G Hasegawa-Kishino-Yano model Has a more general substitution rate
Part of Jukes-Cantor distance matrix for primate examples (is much larger; for outgroup) Matrix will be used for clustering methods
Clustering
UPGMA
Neighbor-Joining Method
N-J Method produces an Unrooted, Additive tree
What is required for the Neighbour joining method? Distance matrix 0. Distance Matrix Neighbor-Joining Method An Example
PAM distance 3.3 (Human - Monkey) is the minimum. So we'll join Human and Monkey to MonHum and we'll calculate the new distances. Mon-Hum MonkeyHumanSpinachMosquitoRice 1. First Step
After we have joined two species in a subtree we have to compute the distances from every other node to the new subtree. We do this with a simple average of distances: Dist[Spinach, MonHum] = (Dist[Spinach, Monkey] + Dist[Spinach, Human])/2 = ( )/2 = Mon-Hum MonkeyHumanSpinach 2. Calculation of New Distances
HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) 3. Next Cycle
HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) Spin-Rice 4. Penultimate Cycle
HumanMosquito Mon-Hum MonkeySpinachRice Mos-(Mon-Hum) Spin-Rice (Spin-Rice)-(Mos-(Mon-Hum)) 5. Last Joining
Human Monkey Mosquito Rice Spinach The result: Unrooted Neighbor-Joining Tree
Bootstrapping
Why are trees not exact?
Pairwise distances usually not tree-like
Searching tree space
Maximum likelihood criterion
Parsimony criterion
Parsimony with molecular data
Parsimony criterion Paul Higgs:
Is the best tree much better than others? L : likelihood at nodes
Use Maximum Likelihood to rank alternate trees yes same topology NJ tree is 2nd best
Use Parsimony to rank alternate trees different topology ; parsimony differentiates weakly
Quartet puzzling
MCMC: Markov chain with Monte Carlo
Topology probabilities according to MCMC
Clade probability compared from tree methods NJ method is very fast and close to being the best
Lecture and Book Lecture by Paul Higgs online.itp.ucsb.edu/online/infobio01/higgs/ see online.itp.ucsb.edu/online/infobio01/ for many lectures Book by Wen-Hsiong Li 李文雄 “Molecular Evolution” (Sinauer Associates, 1997)
CMS Molecular Biology Resource Phylogeny - Molecular Evolution The Tree of Life Web Project tolweb.org/tree/phylogeny.html Web Resources in Molecular Evolution and Systematics darwin.eeb.uconn.edu/molecular-evolution.html Some web sites on Molecular Evolution
On-line service clustalw.genome.ad.jp/ Softw are ftp-igbmc.u-strasbg.fr/pub/ClustalX/ ftp-igbmc.u-strasbg.fr/pub/ClustalW/ Some web sites on ClustalW