A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
Current challenges for molecular phylogenetics Barbara Holland School of Mathematics & Physics University of Tasmania Mostly statistical.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
Heuristic alignment algorithms and cost matrices
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Phylogenetic Trees Presenter: Michael Tung
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Multiple sequence alignment
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Probabilistic methods for phylogenetic trees (Part 2)
Phylogenetic trees Sushmita Roy BMI/CS 576
1 Additive Distances Between DNA Sequences MPI, June 2012.
CatDogRat Dog3 Rat45 Cow676 Barbara Holland Phylogenetics Workhop, August 2006 Cat Dog Rat Cow Distance Based Methods for estimating phylogenetic.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
1 GRAPHS - ADVANCED APPLICATIONS Minimim Spanning Trees Shortest Path Transitive Closure.
Hidden Markov Models for Sequence Analysis 4
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
ZORRO : A masking program for incorporating Alignment Accuracy in Phylogenetic Inference Sourav Chatterji Martin Wu.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton.
Lecture 3 Describing Data Using Numerical Measures.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Sequence Alignment Csc 487/687 Computing for bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Pairwise Sequence Analysis-III
MATH 224 – Discrete Mathematics
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
1 Topic 4 : Ordered Logit Analysis. 2 Often we deal with data where the responses are ordered – e.g. : (i) Eyesight tests – bad; average; good (ii) Voting.
World 1-1 Pythagoras’ Theorem. When adding the areas of the two smaller squares, a2a2 Using math we say c 2 =a 2 +b 2 b2b2 c2c2 their sum will ALWAYS.
Evolutionary Models CS 498 SS Saurabh Sinha. Models of nucleotide substitution The DNA that we study in bioinformatics is the end(??)-product of evolution.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Mareike Fischer Revisiting the question: How many characters are needed to reconstruct the true tree? Mareike Fischer and Marta Casanellas Isaac Newton.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Basics in R part 2. Variable types in R Common variable types: Numeric - numeric value: 3, 5.9, Logical - logical value: TRUE or FALSE (1 or 0)
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Evolutionary Interpretation of Log Odds Scores for alignment Alexei Drummond Department of Computer Science.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Distance-based phylogeny estimation
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
Research in Computational Molecular Biology , Vol (2008)
Summary and Recommendations
Pairwise Sequence Alignment (cont.)
Factoring The simple Case
The Most General Markov Substitution Model on an Unrooted Tree
Summary and Recommendations
Imputing Supertrees and Supernetworks from Quartets
Presentation transcript:

A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis

The general markov model (GMM) π M1M1 M2M2 M3M3 M4M4 M5M5 ACGT e.g. π = ACGT A C G T M =

Base composition In the GMM the mutation transition matrices do not have to be symmetrical. As a consequence of this, base frequencies could be different in different taxa. Almost all phylogenetic methods / commonly used models cannot account for drift in base composition across the tree.

The exception: Log-det distances d xy = -ln det F xy GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC... GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC... ACGT A6000 C1802 G1181 T1009 Fxy =

Markov invariants The log det is an example (the simplest) of a Markov invariant JS and PJ extended the theory of Markov invariants to larger subsets of taxa – Tangles (3 taxa) – Squangles (4 taxa) – Stangles

Math wizards...

...and their magical polynomials and another 66,712 terms coefficient indices e.g. 3*p 1 *p 18 *p 73 *p 168 *p 255 = 3*p AAAA *p ACAC *p CAGA *p GGCT *p TTTG

Squangle table q1q1 q2q2 q3q3 0 -uu v 0 -v -ww 0 q 1 + q 2 + q 3 = 0

Choosing a quartet q1q1 q2q2 q3q3 0 -uu v 0 -v -ww 0 u -u

Choosing a quartet q1q1 q2q2 q3q3 0-uu v0-v -ww 0 u=0

Residual sum of squares Pick the quartet tree that minimises the residual sum of squares (RSS) u = max {0,(q 3 -q 2 )/2} (v,w similar) The RSS are always of the form q [(q 3 -q 2 )/2 – u] 2 If things are in the right order (q 3 >q 2 ) then the second term vanishes, but if they aren't then u gets set to 0

Weights (I) Weight each quartet w i = 1/RSS i A posterior probability (ish) weighting scheme for the quartets is then p i = w i /(w 1 +w 2 +w 3 )

Example ((Rhea,Hippo),Platypus,Wallaroo); q 1 = 9.14e-07 q 2 = -7.58e-06 q 3 = 6.67e-06 p 1 = p 2 = p 3 = MtDNA genomes sites RSS 1 = 8.36e-13 RSS 2 = 6.58e-11 RSS 3 = 6.25e-11 u = 7.13e-06 v = 0 w = 0

Weights (II) The RSS weights give a measure of the relative support for each topology. It would also be useful to have a quartet weight that was related to the edge length of the middle edge of the quartet q1q1 q2q2 q3q3 0-uu v0-v -ww 0 The most likely suspect is u = (q 3 -q 2 )/2

q1q1 q2q2 q3q3 0-uu v0-v -ww 0 q1q1 q2q2 q3q3 0-uu v0-v -ww 0 Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1

Basic simulation setup Felsenstein zone Farris zone Jukes Cantor model: equal base frequencies, all changes equally likely 100 data sets for each parameter choice

Simulations (I) Testing power compared to cNJ

Simulations (II) Adding base composition drift Added a GC bias along the long edges ACGT A*p l *b plpl Cplpl *plpl plpl Gplpl plpl *plpl Tplpl *

GC bias on long edges bias = #Sites =200 SQ 71 NJ Felsenstein: short edge = 0.005, long edge = 0.075

Simulations (II) Adding a proportion of invariant sites pInv = #Sites = Felsenstein: short edge = 0.005, long edge = 0.075

Putting it all together Most people want to build trees on more than 4 taxa Fortunately there are already several methods for going from quartets to larger trees – Q* – Quartet puzzling – Any supertree method Or from quartets to splits graphs – QNet

Qnet – distance based weights mt genomes

1 st codon pos 2nd 3rd Qnet – distance based weights

Detecting invariant sites The residual sum of squares (RSS) scores give an opportunity to detect invariant sites. Remove constant sites in order to – Idea 1: Minimise sum of RSS – Idea 2: Minimise minimum RSS

15,000 sites of which 5000 are invariable proportion of constant sites out of 10,000 variable sites was 0.58 constant sitesPP:sum RSSmin RSS E E E E E E E E E E E E E E E E E E E E E E-11

Vagaries of real data Dealing sensibly with missing or ambiguous data Currently remove all sites with questions marks, gaps or ambiguities over the whole alignment Seems better to do this on a per quartet basis

Code R code Python code, creates output that can be understood by Qnet

Simulation plans Compare to likelihood Compare to NJ with log-det distances Look at rates across sites instead of just proportions of invariant sites