Bill Bruno Brian Foley Thomas Leitner Theoretical Biology & Biophysics

Slides:

Advertisements

Similar presentations

An Introduction to Phylogenetic Methods

Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.

1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.

IE68 - Biological databases Phylogenetic analysis

Molecular Evolution Revised 29/12/06

© Wiley Publishing All Rights Reserved. Phylogeny.

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.

Bioinformatics and Phylogenetic Analysis

In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.

BME 130 – Genomes Lecture 26 Molecular phylogenies I.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Lecture 24 Inferring molecular phylogeny Distance methods

Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

Phylogenetic Analysis

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Terminology of phylogenetic trees

Alexis Dereeper Homology analysis and molecular phylogeny CIBA courses – Brasil 2011.

Christian M Zmasek, PhD 15 June 2010.

Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)

1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections

Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.

A brief introduction to phylogenetics

Introduction to Phylogenetics

Construction of Substitution Matrices

Calculating branch lengths from distances. ABC A B C----- a b c.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Copyright OpenHelix. No use or reproduction without express written consent1.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.

Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,

Applied Bioinformatics Week 8 Jens Allmer. Theory I.

Construction of Substitution matrices

1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.

각종 생물정보 분석도구 의 실무적 활용 및 실습 김형용 개발팀 Insilicogen, Inc.

Phylip PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). PHYLIP is the most widely-distributed.

Bioinformatics Overview

Introduction to Bioinformatics Resources for DNA Barcoding

Evolutionary genomics can now be applied beyond ‘model’ organisms

Phylogenetic basis of systematics

Inferring a phylogeny is an estimation procedure.

Phylogenetic Inference

Clustering methods Tree building methods for distance-based trees

Goals of Phylogenetic Analysis

Methods of molecular phylogeny

Patterns in Evolution I. Phylogenetic

Inferring phylogenetic trees: Distance and maximum likelihood methods

#30 - Phylogenetics Distance-Based Methods

Lecture 7 – Algorithmic Approaches

Introduction to Bioinformatics

High-Throughput Identification and Quantification of Candida Species Using High Resolution Derivative Melt Analysis of Panfungal Amplicons Tasneem Mandviwala,

Algorithms for Inferring the Tree of Life

But what if there is a large amount of homoplasy in the data?

Presentation transcript:

A short introduction to the theory and practice of phylogenetic inference Bill Bruno Brian Foley Thomas Leitner Theoretical Biology & Biophysics Los Alamos National Laboratory www.t10.lanl.gov

Overview Introduction & Alignments - Brian Foley Distance-based Methods, Models & Tree search - Bill Bruno Character-based Methods, Bootstrap & Molecular Clock - Thomas Leitner Hands-On Work Time Group Discussion www.t10.lanl.gov

From Data: To Phylogenetic Tree: www.t10.lanl.gov www.mrc-lmb.cam.ac.uk/ myosin/trees/trees.html http://www.science.siu.edu/plant-biology/PLB117/GIFs/LandplantTree.gif www.t10.lanl.gov

Multiple Sequence Alignments Choose the data set Select an appropriate outgroup Next closest relative to group(s) under study Still close enough to align well Create the alignment Get Sequences in right format (FASTA for example) Use a program (CLUSTALW, HMMER, DIALIGN) Hand-edit the alignment (BioEdit, SeAl, MASE, JALview) Remove uncertain columns (gaps, for example) www.t10.lanl.gov

Pairwise Alignments Typical settings include gap open and gap extension penalties Dynamic Programming Algorithm is fast and efficient BLAST (Basic Local Alignment Search Tool) does a poor job with pairs that contain many in/dels BLAST scores depend on length as well as % identity http://www.answers.com/topic/sequence-alignment http://www.embl-heidelberg.de/~seqanal/courses/predoc98/dynprog.gif http://acer.gen.tcd.ie/~amclysag/nwswat.html http://www.sbc.su.se/~arne/kurser/swell/pairwise_alignments.html www.t10.lanl.gov

Multiple Sequence Alignments http://evolution.genetics.washington.edu/phylip/software.etc1.html NEVER blindly trust a machine-made alignment always view the entire alignment with an alignment editor (BioEdit, SeAl, MASE, jalview) and adjust or trim questionable regions Consider gaps, IUPAC ambiguity codes (R = purine etc) and how the phylogenetic software will treat them, stripping columns with these characters is one option Gene reorganization presents a problem for genome sized regions Phylogenetic comparison can only be done on region of overlap of all sequences in the alignment Multiple Sequence Alignment Software ProbCons http://probcons.stanford.edu/ TreeAlign Methods in Enzymology 183: 625-644. ClustalW Methods in Enzymology 266: 383-402. MALIGN Journal of Heredity 85: 417-418. HMMER http://hmmer.wustl.edu/ GeneDoc http://www.psc.edu/biomed/genedoc GCG Wisconsin Package TAAR Ctree DAMBE POY ALIGN DNASIS Etc… www.t10.lanl.gov

Distance based methods Alignment + Model  Pairwise Distance Matrix  Tree When more than 3 taxa, tree distances are over determined. So, find best tree. What is "best"? Ideally, distance through tree = pairwise distances Optimality conditions: minimum evolution, least squares, Weighbor... www.t10.lanl.gov

Substitution models Evolutionary Distance = rate  evolutionary time http://hcv.lanl.gov/content/hcv-db/findmodel/findmodel.html : ModelTest via web Evolutionary Distance = rate  evolutionary time Distance of 1.0 means on average one change per site Depends on model of evolution, except for short distances (when there is never more than 1 change per site, no homoplasy) www.t10.lanl.gov

Correcting for multiple events T Sequence d D AATAG GAATA 0 0 ACTAG GAATA 1 1 ACTAG GGATA 2 2 AATAG GGATA 1 3 AAAAG GAATA 1 5 AAAAA GAACA 3 7 www.t10.lanl.gov

Distance Tree Methods Extremely fast Can be unbiased, robust Weighbor is most rigorous, but FastME can give excellent, but biased results Suitable for other problems: UPGMA More reliable Weighbor Fitch- Margoliash BioNJ FastME NJ Less robust Faster Slower www.t10.lanl.gov

Searching for the best tree There are (2n - 3)! / 2n-1(n-1)! trees for n taxa Thus, for larger datasets not all trees can be tested Exhaustive search Heuristic search Stepwise addition Star decomposition Branch swapping Algorithmic trees Other aspects of tree space Random trees Consensus trees Unresolved trees # TAXA # TREES 2 1 4 3 5 15 10 2 E6 22 3 E23 50 3 E74 100 2 E182 10 E6 5 E68667340 www.t10.lanl.gov

Character based methods Uses the aligned sequences directly to calculate a tree according to an optimalization criterion: Maximum parsimony (DNAPARS, PAUP*, MEGA, etc) Discriminates using parsimonious informative sites Selects the tree which requires the least number of steps to explain the alignment Maximum likelihood (DNAML, PAUP*, PAML, etc) Requires an explicit model of character evolution Calculates likelihood for each state at all sites Selects the tree with the highest overall likelihood (least negative log likelihood value) www.t10.lanl.gov

Maximum Parsimony A B C O 1 3 2 O 3 2 1 1 12345 67890 GATCC TAGGC Taxon Alignment 1 12345 67890 GATCC TAGGC GGTCA CATGT GGTCA TATCT O GATAC CAGCA O 1 3 2 Character 2 A B C A G (A) (G) O 3 2 1 Maximum Parsimony tree Tree Steps Sum A 02012 20212 12 B 02012 10222 12 C 01011 20122 10 www.t10.lanl.gov

Bootstrapping Non-parametric bootstrap Bootstrap 50% majority-rule consensus tree /---------------------------------------------------------------------------- p1.136(1) | +---------------------------------------------------------------------------- p1.719(2) | /--------------- p2.135(3) | | | /---------------------74----------------------+--------------- p3.105(4) | | | | | \--------------- p3.529(5) | | | +------------------------------------------------------------- p5.317(6) | +------------------------------------------------------------- p6.6767(7) \------79------+------------------------------------------------------------- p7.6760(8) | /--------------- p8.159(9) | /------83------+ | | \--------------- p8.822(10) | /------78-------+ | | | /--------------- p11.113(12) | | \------99------+ \------91------+ \--------------- p11.9939(13) \---------------------------------------------- p9.256(11) Bipartitions found in one or more trees and frequency of occurrence (bootstrap support values): 1 1 1234567890123 Freq % ------------------------------ ...........** 992 99.2% ........***** 908 90.8% ........**... 833 83.3% ..*********** 786 78.6% ........**.** 776 77.6% ..***........ 738 73.8% ..***..*..... 428 42.8% ...**........ 412 41.2% ..***.**..... 412 41.2% ..**......... 339 33.9% .....**...... 335 33.5% ..***.*...... 303 30.3% .....*..***** 292 29.2% .....**.***** 202 20.2% ..........*** 183 18.3% ..******..... 175 17.5% .....*.*..... 164 16.4% ..*.*........ 139 13.9% ........*..** 138 13.8% .****........ 124 12.4% ...**..*..... 109 10.9% ..***.******* 107 10.7% .....******** 87 8.7% .....*.****** 80 8.0% .****..*..... 79 7.9% ..**...*..... 64 6.4% ...*...*..... 63 6.3% .****.**..... 62 6.2% ..***..****** 54 5.4% 100 groups at (relative) frequency less than 5% not shown Non-parametric bootstrap Calculate a tree under a model using a tree building method Create pseudo replicates of the alignment Recalculate a tree for each pseudo replicate Compute a consensus tree of all pseudo trees Tests the reliability/robustness of the model-method Biased (usually conservative) Parametric bootstrap Tests the evolutionary model and process www.t10.lanl.gov

The molecular clock Assumes ultra-metric data/tree Genetic distance -time relationships www.t10.lanl.gov

The molecular clock Evolutionary model important Rate variation Genetic distance Time www.t10.lanl.gov

Hands-on Open file in BioEdit Calculate distance-matrix tree Manually check & correct alignment Calculate distance-matrix tree Calculate matrix with DNADIST Calculate tree with NEIGHBOR Calculate character-based tree DNAPARS or DNAML Calculate bootstrap support Use SEQBOOT, DNADIST, NEIGHBOR, CONSENSE View tree in TreeView www.t10.lanl.gov

Group discussion Pro’s & Con’s Where to spend your time & effort What else is available www.t10.lanl.gov