Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine

Slides:



Advertisements
Similar presentations
Introduction to Molecular Evolution
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Introduction into Phylogenetics Katja Nowick Group Leader “TFome and Transcriptome Evolution” Bioinformatics Group Paul-Flechsig-Institute for Brain Research.
Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Maximum parsimony Kai Müller.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
GENE 3000 Fall 2013 slides wiki. wiki. wiki.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Phylogeny and the Tree of Life
Phylogenetics LLO9 Maximum Likelihood and Its Applications
Evolutionary genomics can now be applied beyond ‘model’ organisms
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
Methods of molecular phylogeny
Patterns in Evolution I. Phylogenetic
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Phylogeny and the Tree of Life
Lecture 7 – Algorithmic Approaches
Algorithms for Inferring the Tree of Life
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience University of Colorado School of Medicine Consortium for Comparative Genomics

Reconstructing Phylogenies All species are related by descent Splitting and divergence Homologous genes are also related by descent The goal of phylogenetic reconstruction is to determine the order and timing of splitting events Phylogenetic analysis includes inferences of substitution (mutation, selection) processes, ancestral states and functions

Why Phylogenetics? Resolve evolutionary history Important for comparative analysis to account for correlations due to relatedness Disease origins, paths of infection Influenza, HIV Origin of genes, systems, functions

Data Types and Issues Morphology Continuous or discrete Polarized? DNA and Protein Sequence Probabilistic Saturation Protein and RNA structure Transposable element insertion The question of homology

Morphology vs. Molecular “In analyzing phenotypic features, we do not know which part and how much of the genetic base is being analyzed, and hence cannot know about independence” Pleiotropy Multifactoriality Epigenetic effects (environment) Homology Underlying genes may change!

Parts of a Tree Branch, Node Edge, vertex Tips (sequence) Root Topology, T i Geneology, G i ( ( A, B ), ( C, D ) ) ; ( (A:0.23, B:0.35):1.28, (C:0.13, D:0.19) :1.52 ) ;

The Phylogeny Problem Data, Trees, Model of evolution Assumptions Data tree reflects species tree Branch lengths reflect time? Homologous data (common ancestor)

The Problem ACBD

ACBD Oldest group (rooting) is the hardest If oldest split is known, use as outgroup to define the rest of the tree Hard to be sure If sure, may not be very informative

Unrooted Tree A C B D Ignore the biggest problem Reversible models are common anyway (root doesn’t matter)

Topology Space A C B D A C B D D A C B D Times branch lengths Need tree search algorithm

Topology Space 11 taxa => 34 million More money than most will make in lifetime taxa Bill Gates’ wealth Federal deficit 24 taxa ~ Number of atoms in a mole 30 taxa ~10 37 # atoms in the solar system? 50 taxa ~10 74 a very big number

Heuristic, Optimal, Posterior NP hard, so need tricks Distance Estimate of amount of change separating two sequences (species) Calculate analytically (limited), or ML Requires a reversible model of evolution Parsimony Minimal number of changes Poorly specified model (but there is one) Easy to calculate ML, Bayes Model based, don’t toss the data

Distance Reconstruction UPGMA Pair closest sequences first Doesn’t account for rate variation (ultrametric) Neighbor Joining (NJ) Closest but least far from everything else Deals with rate variation BIONJ, WEIGHBOR Least squares (inverse square) weighting of neighbor information

Parsimony and Cladism Find the tree that minimizes the number of changes along the tree Can be calculated in polynomial time Exact or Branch and Bounds practical up to about 30 species Not very many these days After that, it is heuristic Slower than distance, faster than ML

Parsimony and Cladism Note: seen as a model of change, it is a complex and awkward hypothesis Poor reconstruction of ancestral states Has ancestral and derived states, polarity Monophyletic, paraphyletic, polyphyletic Highly dependent on time directionality Can an organism start a new group? Consider enzyme function

Paraphyly, Rooted Tree Lysozyme Lactalbumin Human Humpback Whale Human Humpback Whale Corn Snake PenguinDogfish Gene Duplication

Paraphyly, Rooted Tree Birds Penguin Wren Reptiles Crocodile Alligator Corn Snake IguanaSnapping Turtle Major Adaptive Shift Functional Divergence?

Tree Search Exhaustive Branch and bounds Abandon routes that are less likely than a route we know is pretty good Stepwise addition Best addition point is found as taxa are added Star decomposition

Star Decomposition Separate out all sets of two taxon groups, choose the one that shows the greatest score improvement

Hill Climbing and Simulated Annealing Use with any measure of the “goodness” of a solution Accept if Or Where k varies over time, usually larger Use “taboo list” of recently attempted solutions

Branch Swapping, etc Exchange two branches for each other Nearest Neighbor interchange Subtree pruning and regrafting Tree bisection and reconnection These work surprisingly well A B D C E A B E C D

Likelihood Calculation Conditional likelihood for each site, j, for a node, A Likelihood of each state (e.g., nucleotide g) given the two descendent nodes, e.g. B and C and connecting branches b AB and b AC Left node Right node

Likelihood Calculation Infinite stack of turtles, except that tips are data, and terminate stack At root, multiply each state by its prior (the equilibrium frequencies) and sum over all ancestral states If reversible, root can be anywhere, including tips

Trees and HMMs (or Markov Random Fields, Mixture Models) Can combine the two Goldman, Thorne, and Jones Secondary structure Buried versus exposed Transition probabilities between structural types Training set of proteins with known structures Use to predict secondary structure, protein features

Newer Tricks Bayesian Problem isn’t multiplicative with every parameter Augmented data at roots or specified changes

Modeling Evolutionary Processes State frequencies, transitions versus transversions, rates, context dependency Model complexity Takes longer time to calculate with every parameter Check if complex model is justified given the amount of data (and computer time available) Compare ML or BF of models, model testing

1-∑ G T C A TGCA \To From

1-∑ G T C A TGCA \To From Jukes and Cantor, 1969

        1-∑ G T C A TGCA \To From Kimura, 1980

AA AA AA TT TT TT GG GG GG CC CC CC 1-∑ G T C A TGCA \To From Felsenstein, 1981

GT  T GC  C TG  G GC  C TA  A CT  T CG  G AT  T GA  A AG  G CA  A AC  C 1-∑ G T C A TGCA \To From GTR, ‘84, ‘86, ‘90

        1-∑ G T C A TGCA \To From Non-Reversible, Strand Symmetric Bielawski and Gold, 2002

Progression of Complexity Combining models Rate variation Gamma on average rate, Yang 1995 Freqs, relative rate parameters the same Codon models DNA model underneath amino acid model AAs change slower: constraint, selection Physicochemical, PAM & JTT, context Computationally expensive IID => dependent Mixture models

Statistical Desiridata Identifiability A set of trees are identifiable under a model if the distributions of characters are disjoint for different tree topologies (with trivial exceptions, such as 0 branch lengths) Consistency A method is consistent if it will recover the true tree with arbitrarily high probability given enough data (i.e. a long enough sequence)

Dating Times Need to assume (or validate) an approximate molecular clock Underlying rate constant over time Need paleontological data to link branching patterns to geological time Can have continuum from perfect clock to complete freedom of rate on every branch

Gene Tree versus Species Tree Not necessarily the same thing Sampling from coalescent at speciation Recombination Different genes, gene regions sample the coalescent differently Horizontal gene transfer Non-neutral processes, e.g., convergence Do species always bifurcate? Slow decrease in gene flow Inversions prevent recombination Species may form slowly