Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Phylogenetic Trees Lecture 4
Phylogenetic reconstruction
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
1. 2 Rooting the tree and giving length to branches.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Maximum Parsimony.
NJ was originally described as a method for approximating a tree that minimizes the sum of least- squares branch lengths – the minimum – evolution criterion.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Parsimony Anders Gorm Pedersen
The concept of likelihood refers given some data D, a decision must be made about an adequate explanation of the data. In the phylogenetic framework, one.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Building Phylogenies Parsimony 2.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Maximum parsimony Kai Müller.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Tree Inference Methods
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
A brief introduction to phylogenetics
Lecture 2: Principles of Phylogenetics
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Parsimony and searching tree-space. The basic idea To infer trees we want to find clades (groups) that are supported by synapomorpies (shared derived.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Modelling evolution Gil McVean Department of Statistics TC A G.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
Molecular phylogenetics continued…
Distance-based phylogeny estimation
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Maximum likelihood (ML) method
Clustering methods Tree building methods for distance-based trees
Patterns in Evolution I. Phylogenetic
Why Models of Sequence Evolution Matter
Lecture 8 – Searching Tree Space
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Cladogram Building - 1 ß How complex is this problem anyway ? ß NP-complete:  Time needed to find solution in- creases exponentially with size of problem -> t = c n

Computational Complexity ß How do we proceed ? ß What about the quality of the solution ?  Optimality criterion  Exact and Exhaustive Enumeration Branch and Bound  (maybe) Off-Target and Incomplete Heuristics

Optimality - 1 ß Parsimony analysis:  comprises a group of related methods, united by the goal of optimizing some evolutionary significant quantity but differing in their underlying evolutionary assumptions.

Optimality - 2 ß How good is the solution :  What is its score [relative to alternatives]?. ß Relation of score to evolutionary assumptions  Fitch and Wagner Parsimony  Dollo Parsimony  Camin-Sokal Parsimony  Generalized Parsimony  Constrained Parsimony Group / Component Compatibility Character Compatibility

Exact and Exhaustive ß Enumeration is computationally unfeasible if # taxa is over, say, 10. ß Branch and Bound is computationally feasible for over 20 taxa (50 may even work).

(maybe) Off-Target and Incomplete ß Heuristics  Step-wise Addition  Star Decomposition  Branch Swapping

Step-wise Addition - 1 D A EC B E A DC B A B C BA CD BA CD B AC D A D B E C DE A BCBE C DA

Step-wise Addition - 2 ß Dependent on taxon sequence in data matrix. ß Excessively greedy. ß Susceptible to local optima.

Branch Swapping ß Local rearrangements of parts of cladogram  Nearest Neighbor Interchange  Subtree Pruning and Regrafting  Tree Bisection and Reconnection

Optimality - 3 Kind of Scores ß Length (number of steps) ß Consistency Index (CI) ß Retention index (RI) ß Corrected Extra Length (CEL) ß Redundancy Quotient ß AUCC ß HDR ß CCSI ß…ß…

Fitch & Wagner ß Characters: W: binary, ordered multistate, continuous F: unordered multistate ß Transformation: Free reversibility root and cladogram-length decoupled. Change in any direction equally probable (symmetry). W: intermediate states always involved. Thus 1 -> 3 implies 2 steps. F: Any state can transform into any other. Thus 1 -> 3 implies 1 step.

Wagner: Cladogram length - 1 B C A D E BC A DE 0 BC A DE 0,21,3 1,2 ?? ?

0 BC A DE Wagner: Cladogram length BC A DE 0,21,3 1, BC A DE

Fitch: Cladogram length A E D B C BC A DE 0,2 0 0, BC A DE

AB C D E Dollo: Multiple origins not allowed AB C D E

Generalized Parsimony a b c d abcdabcd Wagner a b c d abcdabcd Fitch M 2M 3M M 2M M a b c d abcdabcd Dollo A C G T ACGTACGT T-sition/T- version 1 Gain vs Loss

Models of Evolutionary Change ß Molecular Data ß Maximum Likelihood:  “Given the phylogeny, what is the probability to find the data as I did ?”  Substitution Types  Substitution Probabilities

Models: Substitution Types GTR TrNSYM HKY F84 K3ST F81 K2P JC T-versions; 2 T-sition class T-versions vs T-sitions Single substitution type Equal base frequencies T-versions; 2 T-sition class T-versions vs T-sitions Equal base freq’s

Substitution Types: What do they all mean ? ß GTR, e.g., stands for Generalized Time Reversible, meaning that the overall rate of change from base i to base j in a given length of time is the same as the rate of change from base j to base i. ß Each type corresponds to a table of substitution rates for all pairs of the nucleotides A, C, G, and T

Substitution Rate Table  Q = R +. X   A    C    G    T ACGTACGT A C G T  a  b  c  d  e  f A C G T ACGTACGT  g  h  i  j  k  l   A = frequency parameter    = mean instantaneous SR ß a, … k, l = relative rate parameters. ß All models can be obtained by restricting the parameters in R.

Models: Substitution Rates ß GTR: a=g, b=h, …, e=k, f=l ß TrN: a = c = d = f  K3ST:  A =  C =  G =  T = 1/4  JC: a = b = c = d = e = f = 1  A =  C =  G =  T = 1/4  A    C    G    T ACGTACGT A C G T  a  b  c  d  e  f A C G T ACGTACGT  g  h  i  j  k  l

Models: Substitution Probabilities ß P (t) = e Qt ß P is evaluated by decomposing Q into its eigenvalues and eigenvectors. ß We have a P for every branch t in the cladogram.

Rate vs Time ß All models:  P(i->j) depends on t and  through the product  t.  A branch can be long because it represents a long period of time OR because the rate of substitution has been high.  Impossible to tell apart, unless perfect mol. clock.

Rate + Time = Branch Length  If: Mean substitution rate  is set to 1. ß And: Relative rate parameters a, b, … f are scaled: -> average at equilibrium = 1 ß Then: Branch Length = expected number of substitutions per site.

Recap.  Evolution of DNA sequences is modeled by a stochastic process in which each site evolves in time (t) independently of all other sites, according to a Poisson process with rate .  Because the rate  only occurs in products of the form  t, the absolute value of  is arbitrary. ß Thus, all times should be considered relative to one another, and not as absolute values.  Products of the form  t represent expected amounts of change.

Likelihood of a Cladogram - 1 ß If: sites in the sequence evolve independently, ß Then: data represent multinomial sample. ß Thus: overall goodness- of-fit statistic is applicable (Log Likelihood Ratio Test).

Likelihood of a Cladogram - 2  Likelihood of Clado- gram  Likelihoods of occurrence of each state at each node as a function of cladogram topology and branch lengths. ß Cladogram is given: How good is it ?

Likelihood of a Cladogram - 3 ß The conditional likelihood of state i at sequence position j in taxon A is: L (  Aj =i) = [  P ik ( AB )L(  Bj =k)]. [  P il ( AC )L(  Cj =l )]

Likelihood of a Cladogram - 4 ß See figure 10 in SOWH.

Maximum Likelihood ß Pro: Consistency  As the number of items of data (n) increases, the probability that the estimator is far from the true value of the parameter (cladogram structure) decreases to zero. ß But:  Inferential consistency depends on the model.  Only finite amounts of data are considered, thus a ‘long-term’ property is not necessary.

Maximum Likelihood - 2 ß “ Anyone who considers this model (Poisson Process Model of DNA substitution) complex should bear in mind that it is the simplest mathematical model of state change with constant probabilities per unit time, and that a particular case (that of a very low rate of change) is used to justify parsimony methods. ß The model does not allow for insertions, deletions, and inversions.

When does ML = Parsimony ? ß They estimate different parameters, therefore the estimates cannot match exactly. ß For cladogram structure alone:  If PPM is correct, and we assume the expected amount of change,  t, to be very small, then the probability structures become the same.  For realistic values of  t, the two models do not behave identically.

Extensions of ML ß Rate heterogeneity among sites ß Other data types (except sequences)  gene frequencies  restriction sites ß Pairwise Distance Methods  immunological data  DNA-DNA hybridizations