Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Heuristic alignment algorithms and cost matrices
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Bioinformatics and Phylogenetic Analysis
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Summary and Recommendations. Avoid the “Black Box” Researchers invest considerable resources in producing molecular sequence dataResearchers invest considerable.
Steps of the phylogenetic analysis
Maximum Parsimony.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
What Is Phylogeny? The evolutionary history of a group.
Phylogenetic Analysis
Characterizing the Phylogenetic Tree-Search Problem Daniel Money And Simon Whelan ~Anusha Sura.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Christian M Zmasek, PhD 15 June 2010.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Tree Inference Methods
Models of sequence evolution GTR HKY Jukes-Cantor Felsenstein K2P Tree building methods: some examples Assessing phylogenetic data Popular phylogenetic.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
A brief introduction to phylogenetics
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
GENE 3000 Fall 2013 slides wiki. wiki. wiki.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Phylogenetic Inference
Methods of molecular phylogeny
Systematics: Tree of Life
Summary and Recommendations
BNFO 602 Phylogenetics Usman Roshan.
Why Models of Sequence Evolution Matter
Systematics: Tree of Life
Summary and Recommendations
Presentation transcript:

Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting program Beware phyml overwrites tree file with every search Ctrl C quits phyml and restores terminal P6 figure refers to all searches generally

Phylogeny and Genome Biology Introduction Phylogeny refers to the ancestry of a biological lineage, but is also synonymous with phylogenetic tree. Phylogeny is tree-like, or dichotomous Phylogeny provides the historical basis to the comparative method. Genomes are historical entities; their structure and function reflect the past. There is a need for genomic systematics to establish the identities of genomic phenomena.

Phylogeny and Genome Biology Principle of phylogenetics Inferring relationships is about similarity. Homology describes similarity due to common inheritance from an ancestor. Homologous characters are useful similarity. Homoplasy describes similarity due to independent acquisitions of the same or superficially similar character state. Homoplasious characters provide a mis-leading picture of phylogeny. Distance in a phylogenetic tree reflects a decreasing number of shared, homologous characters (assuming that evolution maximises homology).

Phylogeny and Genome Biology Methods Stages in phylogenetic analysis: 1.Data preparation; multiple alignment of DNA or protein sequences. Similar principle applies to character states. 2.Data scoring; producing genetic distances or character states (‘distance’ or ‘discrete’ data). 3.Tree sorting; processes for searching ‘tree-space’, e.g., hill-climbing or MCMC. 4.Estimation; identifying the most acceptable tree topology and model parameters using a variety of methods (‘clustering’ or ‘optimising’ methods). ClusteringOptimising DistanceNeighbour-joining UPGMA Minimum evolution DiscreteMaximum parsimony Maximum likelihood Bayesian inference Phylogenetic methods: A phylogenetic method is judged according to its ‘accuracy’ (i.e., obtains the true tree) and ‘precision’ (i.e., always produces the same tree).

Phylogeny and Genome Biology Applications to genome biology Molecular systematics; required to accurately identify members of gene families. Orthology and paralogy are best determined through phylogenetics. Gene family evolution; gene families evolve according to birth-death processes. Gene duplications and losses can be inferred through comparisons of ‘gene’ and ‘species’ trees. Horizontal gene transfer; the placement of a gene in the ‘wrong’ position within a phylogeny is used to support HRT. This obviously depends on an accurate view of the organismal phylogeny.

Phylogeny and Genome Biology Recombination; sequences may contain multiple phylogenetic signals (‘mixed histories’). Many tests for recombination and gene conversion use phylogenetic profiles to detect breakpoints. Microarray data analysis; presence or absence data generated by microarray assays can be used to estimate a phylogeny or converted into a distance matrix. Phylogenomics; gene order, gene content and concatenated sequences can be used to infer phylogeny. Using the theoretical information limit can virtually eliminate sampling error. Applications to genome biology

Phylogeny and Genome Biology Clustering methods (e.g., Neighbour-joining) ABCDEFGHI A· B0.001· C · D · E · F · G · H · I · Principles: Tree topology and branch lengths are estimated from a genetic distance matrix. Advantages: A single tree is estimated by minimising genetic distance, in a short time and with little computational expenditure. Disadvantages: The method lacks accuracy because there is no attempt to correct for potential bias (homoplasy). The method lacks precision because the outcome is partly contingent on the tree with which the search process begins. There is no optimising criterion, so two trees cannot be statistically compared.

Phylogeny and Genome Biology Non-Parametric Methods - Maximum parsimony Principles: Searches through tree topologies in ‘tree-space’ using a ‘hill-climbing’ algorithm. Applies an optimising criterion: maximum parsimony. Scores trees on their ‘length’, i.e., the number of character state changes required to explain the distribution of characters on a given tree topology. Selects the topology with the fewest character changes overall. Advantages: Generally accurate method with few assumptions. Phylogenetic hypotheses can be statistically tested by comparing the lengths of different trees. All available information on sites is used in reconstruction. Tree estimation is relatively fast and undemanding. Disadvantages: There are typically several shortest trees, resulting in a potentially ambiguous consensus topology. There is no explicit model of evolution and so the method is prone to error under certain circumstances, e.g., long-branch attraction (homoplasy), making MP trees positively misleading in many cases.

Phylogeny and Genome Biology NJ MP

Phylogeny and Genome Biology Maximum likelihood Principles: Applies a complex model of DNA or protein sequence evolution that estimates parameters for specific substitutions and other qualities of molecular sequences. Calculates the ‘likelihood’ of each individual character in each candidate tree and applies an optimising criterion, maximising the likelihood of the data, given the model. Locates the most likely tree topology through a hill-climbing algorithm. Various models accommodate sources of molecular homoplasy that might result in the wrong tree: ‘Multiple hits’ (substitutional saturation) Rate convergence Rate heterogeneity Base composition bias Codon usage bias Secondary structure Covariance

Phylogeny and Genome Biology Advantages: Highly accurate because considerable biological realism is introduced through the substitutional model. This allows various forms of homoplasy to be corrected for. Phylogenetic estimation within the likelihood framework provides a robust statistical context in which to evaluate specific hypotheses. A single tree is produced that is generally precise. Maximum likelihood Disadvantages: The complexity of the estimation process means that it is slow and computationally demanding. The hill-climbing algorithm is susceptible to local optima and so does not guarantee to return the most optimal solution. NP-hard problem. Similar objection to all methods dependent on heuristic searches.

Phylogeny and Genome Biology GTR+  +  lnL = JC69 lnL =

Phylogeny and Genome Biology Bayesian inference Principles: Frequentist approaches make long-term frequency statements about random data, conditional on a model. “Given the tree and model, how unlikely are the data?” Bayesian inference estimates probability of random parameters, conditional on data. “Given the data, what are the probabilities of the tree and model?” Uses an MCMC process to search through tree-space. Selects the tree-topology with the highest probability, given the data. Advantages: Intuitive Potential for any complex model. Provides both parameter estimates (i.e., tree) and their probabilities in a single analysis. Many different hypotheses can be evaluated in a single analysis. The MCMC algorithm makes integrating over all parameter values fast and accurate. Coupled MCMC’s are able to break out of local optima.

Phylogeny and Genome Biology Disadvantages: An evolutionary model must be specified a priori, in form of prior probabilities (‘priors’). Is there sufficient knowledge of these probabilities? The MCMC must be run long enough for variation in the parameter estimates to smooth out or reach ‘convergence’. The time required is never certain. Posterior probabilities describe the absolute probability of particular nodes and branch lengths; these can be overestimated. BI

Phylogeny and Genome Biology Remember… All trees are wrong There are no free lunches

Phylogeny and Genome Biology Rate Heterogeneity All methods are prone to long branch attraction artefacts. All but MP can be accounted for in some way. Most commonly used - gamma correction.

Phylogeny and Genome Biology Further details Textbooks: Page & Holmes (1999) Molecular Evolution: A Phylogenetic Approach. Blackwell Science. Felsenstein (2004) Inferring Phylogenies. Sinauer Associates. Website: Felsenstein’s Phylogeny program page (links to available software): Software: PAUP* (NJ, MP, ML): PHYLIP (NJ, MP, ML): MrBayes (BI): Splitstree (Networks)