PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London http://abacus.gene.ucl.ac.uk/

Plan Overview of PAML, things it can do, and especially things that other program don’t do. An example (of detecting amino acids under positive selection) The trouble with sliding windows

PAML programs, currently in ver 3.15 PAML programs are written in ANSI C. Executables are provided for MS Windows and Mac OSX. Source codes can be compiled for unix and other platforms. Free for academics (and everybody else). Sequential, not parallelized. Old-style command-line programs, with no GUI, no menu, no mice. Yang’s theorem: Every version of PAML has bugs.

PAML programs baseml basemlg codeml evolver yn00 chi2 ML under nucleotide-based models Continuous-gamma, for bases (Yang 1993) ML for amino acids & codons The world’s best named simulation program d N and d S estimation using Y&N2000  2 critical values and p values pamp mcmctree Parsimony calculations (Yang and Kumar 1996) Species divergence times, soft bounds, relaxed clocks (Yang & Rannala 2006)

PAML docs & examples doc in doc/: pamlDOC.pdf, pamlFAQ.pdf, pamlHistory.txt examples/ are provided with README files Apologies for poor support. Bug reports can come to my mailbox. Questions should go to paml discussion group: http://www.rannala.org/gsf

Major weaknesses Poor tree search Poor user interface Major strength Many models implemented in the likelihood framework.

Maximum likelihood parameter estimation and likelihood ratio tests of hypotheses under a number of substitution models based on nucleotides, amino acids, and codons (such as the molecular clock, rate variation among sites). Most of the nucleotide-based models are available in PAUP. Most of models are available in MrBayes? Uses of PAML (i)

Uses of PAML (ii) Likelihood (empirical Bayes) reconstruction of ancestral nucleotide, amino acid, or codon sequences. This is the same as parsimony reconstruction except that it accounts for different branch lengths and different rates of change between states. Yang, Z., S. Kumar, and M. Nei. 1995. Genetics 141:1641-1650. Koshi, J. M., and R. A. Goldstein. 1996. J. Mol. Evol. 42:313-320. Pupko, T., I. Pe’er, R. Shamir, and D. Graur. 2000. Mol. Biol. Evol. 17:890-896.

Uses of PAML (iii) Combined analysis of heterogeneous data sets. MrBayes has implemented more powerful models of this kind (Nylander, et al. 2004. Syst. Biol. 53:47-67). These should make the following debates unnecessary: combined analysis (total evidence) vs. separate analysis Supertree vs. supermatrix Yang, Z. 1996. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42:587-596. Pupko, T., D. Huchon, Y. Cao et al. 2002. Combining multiple data sets in a likelihood analysis: which models are the best? Mol. Biol. Evol. 19:2294- 2307.

Uses of PAML (iv) Likelihood ratio test of the clock and likelihood estimation of species divergences under clock and relaxed-clock models (baseml & codeml) Bayesian estimation of species divergence times using soft bounds and relaxed molecular clocks (mcmctree), similar to Jeff Thorne’s multidivtime. Rambaut, A., and L. Bromham. 1998. Mol. Biol. Evol. 15:442-448. Yoder, A. D., and Z. Yang. 2000. Mol. Biol. Evol. 17:1081-1090. Yang, Z., and A. D. Yoder. 2003. Syst. Biol. 52:705-716. Yang, Z., and B. Rannala. 2006. Mol. Biol. Evol. 23:212-226. Rannala, B. and Z. Yang. in preparation.

Uses of PAML (iv): Codon substitution models & detection of selection in protein-coding genes (codeml) Branch models to test positive selection on lineages on the tree (Yang 1998. Mol. Biol. Evol. 15:568-573) Site models to test positive selection affecting individual sites (Nielsen & Yang. 1998. Genetics 148:929-936; Yang, et al. 2000. Genetics 155:431-449) Branch-site models to detect positive selection at a few sites on a particular lineage (Yang & Nielsen. 2002. Mol. Biol. Evol. 19:908-917; Yang, et al. 2005. Mol. Biol. Evol. 22:1107-1118; Zhang, J., R. Nielsen, and Z. Yang. 2005. Mol. Biol. Evol. 22:2472-2479)

MacCallum, C., and E. Hill. 2006. Being positive about selection. PLoS Biol 4:e87. PLoS Biol is receiving and rejecting too many manuscripts that use the M&K test and paml/codeml to detect positive selection. Their main criterion right now is that the ms. should include experimental verification to justify publication in such high-profile journals.

LRT of amino acid sites under positive selection H 0 : there are no sites at which  > 1 H 1 : there are such sites Compare 2  = 2( 1  0 ) with a  2 distribution (Nielsen & Yang 1998 Genetics 148:929-936; Yang, Nielsen, Goldman & Pedersen 2000. Genetics 155:431-449)

Models M1a & M2a M1a (neutral) Site class : 0 1 Proportion: p 0 p 1 Proportion: p 0 p 1 : 0 <1 1 =1  ratio:  0 <1  1 =1 M2a (selection) Site class: 0 1 2 Proportion: p 0 p 1 p 2 Proportion: p 0 p 1 p 2 : 0 1  ratio:  0 1 0 =0 is fixed Modified from Nielsen & Yang (1998), where  0 =0 is fixed

Human MHC Class I data: 192 alleles, 270 codons Model Parameter estimates M1a (neutral)  7,490.99 p 0 = 0.830,  0 = 0.041 p 1 = 0.170,  1 = 1 M2a (selection)  7,231.15 p 0 = 0.776,  0 = 0.058 p 1 = 0.140,  1 = 1 p 2 = 0.084,  2 = 5.389 Likelihood ratio test of positive selection: 2  = 2  259.84 = 519.68, P < 0.000, d.f. = 2

Posterior probabilities for MHC (M2a)

25 sites identified under M2a

There are a few wrong ways for detecting selection, one of which is sliding windows.

Sliding window analysis (mouse-rat BRCA1)

Sliding window analysis (fake data)

Two trends in sliding window analysis Both d S and d N fluctuate smoothly (because consecutive windows overlap) d S fluctuates more than d N (because there are fewer silent than replacement sites) Sliding windows may be useful for displaying trends that are known to exist, but is misleading if used to detect trends.

Orthodox statistical analysis formulate a biological hypothesis design the experiment & collect data test whether the data are compatible with the hypothesis The more-common way of data analysis in biology a large amount of data, no a priori hypothesis filter and plot data to identify “unexpected” patterns test the patterns using statistical tests

Acknowledgment (sliding-window analysis) Karl Schmid

PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Similar presentations

Presentation on theme: "PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London

Similar presentations

Presentation on theme: "PAML: Phylogenetic Analysis by Maximum Likelihood Ziheng Yang Depart of Biology University College London"— Presentation transcript:

Similar presentations

About project

Feedback