Mareike Fischer Revisiting the question: How many characters are needed to reconstruct the true tree? Mareike Fischer and Marta Casanellas Isaac Newton.

Slides:



Advertisements
Similar presentations
Tree Building What is a tree ? How to build a tree ? Cladograms Trees
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
MS&E 211 Quadratic Programming Ashish Goel. A simple quadratic program Minimize (x 1 ) 2 Subject to: -x 1 + x 2 ≥ 3 -x 1 – x 2 ≥ -2.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
An Introduction to Phylogenetic Methods
A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
Chapter 7 ~ Sample Variability
Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
AP Statistics Chap 10-1 Confidence Intervals. AP Statistics Chap 10-2 Confidence Intervals Population Mean σ Unknown (Lock 6.5) Confidence Intervals Population.
SWIMMING WITH THE PYTHAGOREAN THEOREM Jeremiah Parker.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Chapter 7: Sample Variability Empirical Distribution of Sample Means.
Comp. Genomics Recitation 3 The statistics of database searching.
1 Review Sections Descriptive Statistics –Qualitative (Graphical) –Quantitative (Graphical) –Summation Notation –Qualitative (Numerical) Central.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,
Probabilistic Verification of GBN Group Members: Lin Huang(lh2647), Yuechen Qin(yq2158), Xi Chen(xc2257), Runxi Zhou(rz2286), Shuang Zhang(sz2426) 04/08/2014.
The star-tree paradox in Bayesian phylogenetics Bengt Autzen Department of Philosophy, Logic and Scientific Method LSE.
8.4 Mathematical Induction
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Laplace, Pierre Simon de ( )
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Mareike Fischer How many characters are needed to reconstruct the true tree? Mareike Fischer and Mike Steel Future Directions in Phylogenetic Methods and.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Sampling Theory Determining the distribution of Sample statistics.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Sampling and Sampling Distribution
Lecture 6B – Optimality Criteria: ML & ME
An Equivalence of Maximum Parsimony and Maximum Likelihood revisited
Maximum likelihood (ML) method
Goals of Phylogenetic Analysis
Overview: Fault Diagnosis
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
Lecture 6B – Optimality Criteria: ML & ME
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
The Most General Markov Substitution Model on an Unrooted Tree
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Presentation transcript:

Mareike Fischer Revisiting the question: How many characters are needed to reconstruct the true tree? Mareike Fischer and Marta Casanellas Isaac Newton Institute, 20 June 2011

Mareike Fischer The Problem Given: Alignment (e.g. DNA) Wanted: Reconstruction of the ‘true’ tree Solution: e.g. But: Is the alignment long enough for a reliable reconstruction?

Mareike Fischer Previous Approaches 1.Churchill, von Haeseler, Navidi (1992) 4 taxa scenario Observations:  The probability of reconstructing the true tree increases with the length of the interior edge. more characters Rec. Prob. int. edge

Mareike Fischer Previous Approaches 2. Yang (1998) 4 taxa scenario, interior edge ‘fixed’ at 5% of tree length 5 different tree-shapes were investigated Observations: ‘Farris Zone’: MP better ‘Felsenstein Zone’: ML better The optimal length for the interior edge ranges between and Tree length Rec. Prob.

Mareike Fischer Limitation of previous approaches The approaches mentioned so far are based on simulations Still needed: Mathematical analysis of influence of branch lengths on tree reconstruction

Mareike Fischer Previous theoretical results x y y y y Here, the number k of characters needed to reconstruct the true tree grows at rate. But what happens if we fix the ratio (y:=px), and then take the value of x that minimizes k? Steel and Székely (2002):

Mareike Fischer Previous theoretical results Setting: 4 taxa, pending edges of length px (with p>1), short interior edge of length x, 2-state symmetric model x px MF and M. Steel (2009):  Sequence length needed to reliably reconstruct the tree grows at rate p 2

Mareike Fischer Limitation of previous approaches Previous approaches are based on simulations, or employ only 2 states (oops ) Still needed: Mathematical analysis of influence of branch lengths on tree reconstruction for 4 states, so…

Mareike Fischer Our Approach Setting: 4 taxa, pending edges of length px (with p>1), short interior edge of length x, Jukes Cantor model x px

Mareike Fischer Most importantly… We kindly apologize for criticizing Miss Parsimony in the past and... Remorsefully offer her an assistant position on our current project. MISS PARSIMONY

Mareike Fischer Main Result k grows at least at rate p 2 For optimal value of x, k grows at rate p 2, so this rate can be achieved for 4 states, too! For reliable MP reconstruction:

Mareike Fischer Idea of Proof. Then (by CLT) Set X i i.i.d., and Note that the true tree T 1 will be favored over T 2 if and only if Z k >0.

Mareike Fischer Idea of Proof Since the X i are i.i.d., μ k and σ k depend only on k and the probabilities P(X 1 =1) and P(X 1 =-1). These probabilities can be calculated (e.g. using Felsenstein, Hadamard or Fourier Transform): (Here, θ=e -4/3x.) Then, for fixed p, the ratio to find a value of x that minimizes k. Note that P(X 1 =1) and P(X 1 =-1) only depend on x and p. can be used

Mareike Fischer Idea of Proof: 2. X i are i.i.d. Since the X i are i.i.d., we have

Mareike Fischer Summary and Extension For MP, the number k of characters needed to reliably reconstruct the true tree grows at rate p 2. What about other methods? Can they do better (e.g. rate p)?

Mareike Fischer Extension Other methods cannot do better!!! (Can be shown using the so-called Hellinger distance.)

Mareike Fischer The Hellinger Distance S: set of site patterns p, q: probability distributions

Mareike Fischer Outlook Questions for future work: What happens when you approach the Felsenstein Zone? What happens in general with different tree shapes or more taxa?

Mareike Fischer Advantages of mathematics… Questions so far? Else, let’s finally see why boring maths formulas can be less frustrating than biology at times…

Mareike Fischer Thanks… … to Marta Casanellas … to the WWTF and the CRM for funding, … to Roger Hargreaves for his terrific cartoons, … to YOU for listening (or at least waking up just on time to read this message ).