A short introduction to the theory and practice of phylogenetic inference Bill Bruno Brian Foley Thomas Leitner Theoretical Biology & Biophysics Los Alamos National Laboratory www.t10.lanl.gov
Overview Introduction & Alignments - Brian Foley Distance-based Methods, Models & Tree search - Bill Bruno Character-based Methods, Bootstrap & Molecular Clock - Thomas Leitner Hands-On Work Time Group Discussion www.t10.lanl.gov
From Data: To Phylogenetic Tree: www.t10.lanl.gov www.mrc-lmb.cam.ac.uk/ myosin/trees/trees.html http://www.science.siu.edu/plant-biology/PLB117/GIFs/LandplantTree.gif www.t10.lanl.gov
Multiple Sequence Alignments Choose the data set Select an appropriate outgroup Next closest relative to group(s) under study Still close enough to align well Create the alignment Get Sequences in right format (FASTA for example) Use a program (CLUSTALW, HMMER, DIALIGN) Hand-edit the alignment (BioEdit, SeAl, MASE, JALview) Remove uncertain columns (gaps, for example) www.t10.lanl.gov
Pairwise Alignments Typical settings include gap open and gap extension penalties Dynamic Programming Algorithm is fast and efficient BLAST (Basic Local Alignment Search Tool) does a poor job with pairs that contain many in/dels BLAST scores depend on length as well as % identity http://www.answers.com/topic/sequence-alignment http://www.embl-heidelberg.de/~seqanal/courses/predoc98/dynprog.gif http://acer.gen.tcd.ie/~amclysag/nwswat.html http://www.sbc.su.se/~arne/kurser/swell/pairwise_alignments.html www.t10.lanl.gov
Multiple Sequence Alignments http://evolution.genetics.washington.edu/phylip/software.etc1.html NEVER blindly trust a machine-made alignment always view the entire alignment with an alignment editor (BioEdit, SeAl, MASE, jalview) and adjust or trim questionable regions Consider gaps, IUPAC ambiguity codes (R = purine etc) and how the phylogenetic software will treat them, stripping columns with these characters is one option Gene reorganization presents a problem for genome sized regions Phylogenetic comparison can only be done on region of overlap of all sequences in the alignment Multiple Sequence Alignment Software ProbCons http://probcons.stanford.edu/ TreeAlign Methods in Enzymology 183: 625-644. ClustalW Methods in Enzymology 266: 383-402. MALIGN Journal of Heredity 85: 417-418. HMMER http://hmmer.wustl.edu/ GeneDoc http://www.psc.edu/biomed/genedoc GCG Wisconsin Package TAAR Ctree DAMBE POY ALIGN DNASIS Etc… www.t10.lanl.gov
Distance based methods Alignment + Model Pairwise Distance Matrix Tree When more than 3 taxa, tree distances are over determined. So, find best tree. What is "best"? Ideally, distance through tree = pairwise distances Optimality conditions: minimum evolution, least squares, Weighbor... www.t10.lanl.gov
Substitution models Evolutionary Distance = rate evolutionary time http://hcv.lanl.gov/content/hcv-db/findmodel/findmodel.html : ModelTest via web Evolutionary Distance = rate evolutionary time Distance of 1.0 means on average one change per site Depends on model of evolution, except for short distances (when there is never more than 1 change per site, no homoplasy) www.t10.lanl.gov
Correcting for multiple events T Sequence d D AATAG GAATA 0 0 ACTAG GAATA 1 1 ACTAG GGATA 2 2 AATAG GGATA 1 3 AAAAG GAATA 1 5 AAAAA GAACA 3 7 www.t10.lanl.gov
Distance Tree Methods Extremely fast Can be unbiased, robust Weighbor is most rigorous, but FastME can give excellent, but biased results Suitable for other problems: UPGMA More reliable Weighbor Fitch- Margoliash BioNJ FastME NJ Less robust Faster Slower www.t10.lanl.gov
Searching for the best tree There are (2n - 3)! / 2n-1(n-1)! trees for n taxa Thus, for larger datasets not all trees can be tested Exhaustive search Heuristic search Stepwise addition Star decomposition Branch swapping Algorithmic trees Other aspects of tree space Random trees Consensus trees Unresolved trees # TAXA # TREES 2 1 4 3 5 15 10 2 E6 22 3 E23 50 3 E74 100 2 E182 10 E6 5 E68667340 www.t10.lanl.gov
Character based methods Uses the aligned sequences directly to calculate a tree according to an optimalization criterion: Maximum parsimony (DNAPARS, PAUP*, MEGA, etc) Discriminates using parsimonious informative sites Selects the tree which requires the least number of steps to explain the alignment Maximum likelihood (DNAML, PAUP*, PAML, etc) Requires an explicit model of character evolution Calculates likelihood for each state at all sites Selects the tree with the highest overall likelihood (least negative log likelihood value) www.t10.lanl.gov
Maximum Parsimony A B C O 1 3 2 O 3 2 1 1 12345 67890 GATCC TAGGC Taxon Alignment 1 12345 67890 GATCC TAGGC GGTCA CATGT GGTCA TATCT O GATAC CAGCA O 1 3 2 Character 2 A B C A G (A) (G) O 3 2 1 Maximum Parsimony tree Tree Steps Sum A 02012 20212 12 B 02012 10222 12 C 01011 20122 10 www.t10.lanl.gov
Bootstrapping Non-parametric bootstrap Bootstrap 50% majority-rule consensus tree /---------------------------------------------------------------------------- p1.136(1) | +---------------------------------------------------------------------------- p1.719(2) | /--------------- p2.135(3) | | | /---------------------74----------------------+--------------- p3.105(4) | | | | | \--------------- p3.529(5) | | | +------------------------------------------------------------- p5.317(6) | +------------------------------------------------------------- p6.6767(7) \------79------+------------------------------------------------------------- p7.6760(8) | /--------------- p8.159(9) | /------83------+ | | \--------------- p8.822(10) | /------78-------+ | | | /--------------- p11.113(12) | | \------99------+ \------91------+ \--------------- p11.9939(13) \---------------------------------------------- p9.256(11) Bipartitions found in one or more trees and frequency of occurrence (bootstrap support values): 1 1 1234567890123 Freq % ------------------------------ ...........** 992 99.2% ........***** 908 90.8% ........**... 833 83.3% ..*********** 786 78.6% ........**.** 776 77.6% ..***........ 738 73.8% ..***..*..... 428 42.8% ...**........ 412 41.2% ..***.**..... 412 41.2% ..**......... 339 33.9% .....**...... 335 33.5% ..***.*...... 303 30.3% .....*..***** 292 29.2% .....**.***** 202 20.2% ..........*** 183 18.3% ..******..... 175 17.5% .....*.*..... 164 16.4% ..*.*........ 139 13.9% ........*..** 138 13.8% .****........ 124 12.4% ...**..*..... 109 10.9% ..***.******* 107 10.7% .....******** 87 8.7% .....*.****** 80 8.0% .****..*..... 79 7.9% ..**...*..... 64 6.4% ...*...*..... 63 6.3% .****.**..... 62 6.2% ..***..****** 54 5.4% 100 groups at (relative) frequency less than 5% not shown Non-parametric bootstrap Calculate a tree under a model using a tree building method Create pseudo replicates of the alignment Recalculate a tree for each pseudo replicate Compute a consensus tree of all pseudo trees Tests the reliability/robustness of the model-method Biased (usually conservative) Parametric bootstrap Tests the evolutionary model and process www.t10.lanl.gov
The molecular clock Assumes ultra-metric data/tree Genetic distance -time relationships www.t10.lanl.gov
The molecular clock Evolutionary model important Rate variation Genetic distance Time www.t10.lanl.gov
Hands-on Open file in BioEdit Calculate distance-matrix tree Manually check & correct alignment Calculate distance-matrix tree Calculate matrix with DNADIST Calculate tree with NEIGHBOR Calculate character-based tree DNAPARS or DNAML Calculate bootstrap support Use SEQBOOT, DNADIST, NEIGHBOR, CONSENSE View tree in TreeView www.t10.lanl.gov
Group discussion Pro’s & Con’s Where to spend your time & effort What else is available www.t10.lanl.gov