A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis
The general markov model (GMM) π M1M1 M2M2 M3M3 M4M4 M5M5 ACGT e.g. π = ACGT A C G T M =
Base composition In the GMM the mutation transition matrices do not have to be symmetrical. As a consequence of this, base frequencies could be different in different taxa. Almost all phylogenetic methods / commonly used models cannot account for drift in base composition across the tree.
The exception: Log-det distances d xy = -ln det F xy GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC... GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC... ACGT A6000 C1802 G1181 T1009 Fxy =
Markov invariants The log det is an example (the simplest) of a Markov invariant JS and PJ extended the theory of Markov invariants to larger subsets of taxa – Tangles (3 taxa) – Squangles (4 taxa) – Stangles
Math wizards...
...and their magical polynomials and another 66,712 terms coefficient indices e.g. 3*p 1 *p 18 *p 73 *p 168 *p 255 = 3*p AAAA *p ACAC *p CAGA *p GGCT *p TTTG
Squangle table q1q1 q2q2 q3q3 0 -uu v 0 -v -ww 0 q 1 + q 2 + q 3 = 0
Choosing a quartet q1q1 q2q2 q3q3 0 -uu v 0 -v -ww 0 u -u
Choosing a quartet q1q1 q2q2 q3q3 0-uu v0-v -ww 0 u=0
Residual sum of squares Pick the quartet tree that minimises the residual sum of squares (RSS) u = max {0,(q 3 -q 2 )/2} (v,w similar) The RSS are always of the form q [(q 3 -q 2 )/2 – u] 2 If things are in the right order (q 3 >q 2 ) then the second term vanishes, but if they aren't then u gets set to 0
Weights (I) Weight each quartet w i = 1/RSS i A posterior probability (ish) weighting scheme for the quartets is then p i = w i /(w 1 +w 2 +w 3 )
Example ((Rhea,Hippo),Platypus,Wallaroo); q 1 = 9.14e-07 q 2 = -7.58e-06 q 3 = 6.67e-06 p 1 = p 2 = p 3 = MtDNA genomes sites RSS 1 = 8.36e-13 RSS 2 = 6.58e-11 RSS 3 = 6.25e-11 u = 7.13e-06 v = 0 w = 0
Weights (II) The RSS weights give a measure of the relative support for each topology. It would also be useful to have a quartet weight that was related to the edge length of the middle edge of the quartet q1q1 q2q2 q3q3 0-uu v0-v -ww 0 The most likely suspect is u = (q 3 -q 2 )/2
q1q1 q2q2 q3q3 0-uu v0-v -ww 0 q1q1 q2q2 q3q3 0-uu v0-v -ww 0 Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1
Basic simulation setup Felsenstein zone Farris zone Jukes Cantor model: equal base frequencies, all changes equally likely 100 data sets for each parameter choice
Simulations (I) Testing power compared to cNJ
Simulations (II) Adding base composition drift Added a GC bias along the long edges ACGT A*p l *b plpl Cplpl *plpl plpl Gplpl plpl *plpl Tplpl *
GC bias on long edges bias = #Sites =200 SQ 71 NJ Felsenstein: short edge = 0.005, long edge = 0.075
Simulations (II) Adding a proportion of invariant sites pInv = #Sites = Felsenstein: short edge = 0.005, long edge = 0.075
Putting it all together Most people want to build trees on more than 4 taxa Fortunately there are already several methods for going from quartets to larger trees – Q* – Quartet puzzling – Any supertree method Or from quartets to splits graphs – QNet
Qnet – distance based weights mt genomes
1 st codon pos 2nd 3rd Qnet – distance based weights
Detecting invariant sites The residual sum of squares (RSS) scores give an opportunity to detect invariant sites. Remove constant sites in order to – Idea 1: Minimise sum of RSS – Idea 2: Minimise minimum RSS
15,000 sites of which 5000 are invariable proportion of constant sites out of 10,000 variable sites was 0.58 constant sitesPP:sum RSSmin RSS E E E E E E E E E E E E E E E E E E E E E E-11
Vagaries of real data Dealing sensibly with missing or ambiguous data Currently remove all sites with questions marks, gaps or ambiguities over the whole alignment Seems better to do this on a per quartet basis
Code R code Python code, creates output that can be understood by Qnet
Simulation plans Compare to likelihood Compare to NJ with log-det distances Look at rates across sites instead of just proportions of invariant sites