Download presentation
Presentation is loading. Please wait.
Published byLorena Sweeting Modified over 9 years ago
1
A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis
2
The general markov model (GMM) π M1M1 M2M2 M3M3 M4M4 M5M5 ACGT 0.30.20.40.1 e.g. π = ACGT A0.800.100.070.03 C0.050.750.150.05 G0.020.030.920.03 T0.020.050.040.87 M =
3
Base composition In the GMM the mutation transition matrices do not have to be symmetrical. As a consequence of this, base frequencies could be different in different taxa. Almost all phylogenetic methods / commonly used models cannot account for drift in base composition across the tree.
4
The exception: Log-det distances d xy = -ln det F xy GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC... GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC... ACGT A6000 C1802 G1181 T1009 Fxy =
5
Markov invariants The log det is an example (the simplest) of a Markov invariant JS and PJ extended the theory of Markov invariants to larger subsets of taxa – Tangles (3 taxa) – Squangles (4 taxa) – Stangles
6
Math wizards...
7
...and their magical polynomials 3 1 18 69 171 256 -3 1 18 69 172 255 -3 1 18 69 175 252 3 1 18 69 176 251 -3 1 18 69 187 240 3 1 18 69 188 239 3 1 18 69 191 236 -3 1 18 69 192 235 -1 1 18 71 169 256 1 1 18 71 172 253 1 1 18 71 173 252 -1 1 18 71 176 249 1 1 18 71 185 240 -1 1 18 71 188 237 -1 1 18 71 189 236 1 1 18 71 192 233 1 1 18 72 169 255 -1 1 18 72 171 253 -1 1 18 72 173 251 1 1 18 72 175 249 -1 1 18 72 185 239 1 1 18 72 187 237 1 1 18 72 189 235 -1 1 18 72 191 233 -3 1 18 73 167 256 3 1 18 73 168 255 3 1 18 73 175 248 -3 1 18 73 176 247 3 1 18 73 183 240 -3 1 18 73 184 239 -3 1 18 73 191 232 3 1 18 73 192 231 and another 66,712 terms coefficient indices 1-256 e.g. 3*p 1 *p 18 *p 73 *p 168 *p 255 = 3*p AAAA *p ACAC *p CAGA *p GGCT *p TTTG
8
Squangle table 1 2 3 4 1 3 2 4 1 4 3 2 q1q1 q2q2 q3q3 0 -uu v 0 -v -ww 0 q 1 + q 2 + q 3 = 0
9
Choosing a quartet 0 0 00 q1q1 q2q2 q3q3 0 -uu v 0 -v -ww 0 u -u
10
Choosing a quartet 0 0 00 q1q1 q2q2 q3q3 0-uu v0-v -ww 0 u=0
11
Residual sum of squares Pick the quartet tree that minimises the residual sum of squares (RSS) u = max {0,(q 3 -q 2 )/2} (v,w similar) The RSS are always of the form q 1 2 + [(q 3 -q 2 )/2 – u] 2 If things are in the right order (q 3 >q 2 ) then the second term vanishes, but if they aren't then u gets set to 0
12
Weights (I) Weight each quartet w i = 1/RSS i A posterior probability (ish) weighting scheme for the quartets is then p i = w i /(w 1 +w 2 +w 3 )
13
Example ((Rhea,Hippo),Platypus,Wallaroo); q 1 = 9.14e-07 q 2 = -7.58e-06 q 3 = 6.67e-06 p 1 = 0.978 p 2 = 0.011 p 3 = 0.011 MtDNA genomes 13856 sites RSS 1 = 8.36e-13 RSS 2 = 6.58e-11 RSS 3 = 6.25e-11 u = 7.13e-06 v = 0 w = 0
14
Weights (II) The RSS weights give a measure of the relative support for each topology. It would also be useful to have a quartet weight that was related to the edge length of the middle edge of the quartet q1q1 q2q2 q3q3 0-uu v0-v -ww 0 The most likely suspect is u = (q 3 -q 2 )/2
15
q1q1 q2q2 q3q3 0-uu v0-v -ww 0 q1q1 q2q2 q3q3 0-uu v0-v -ww 0 Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1
16
Basic simulation setup Felsenstein zone Farris zone Jukes Cantor model: equal base frequencies, all changes equally likely 100 data sets for each parameter choice
17
Simulations (I) Testing power compared to cNJ
18
Simulations (II) Adding base composition drift Added a GC bias along the long edges ACGT A*p l *b plpl Cplpl *plpl plpl Gplpl plpl *plpl Tplpl *
19
GC bias on long edges bias = 12345 #Sites =200 SQ 71 NJ 66 59 49 50 15 39 0 24 0 400 86 68 47 62 8 42 0 35 0 800 93 91 75 53 63 3 60 0 36 0 1600 100 92 56 79 0 59 0 38 0 10000 100 67 95 0 78 0 50 0 Felsenstein: short edge = 0.005, long edge = 0.075
20
Simulations (II) Adding a proportion of invariant sites pInv = 00.10.20.30.40.5 #Sites =200 736460534835 400847862574241 800937670554734 1600978972663522 10000100 9763264 Felsenstein: short edge = 0.005, long edge = 0.075
21
Putting it all together Most people want to build trees on more than 4 taxa Fortunately there are already several methods for going from quartets to larger trees – Q* – Quartet puzzling – Any supertree method Or from quartets to splits graphs – QNet
22
Qnet – distance based weights mt genomes
23
1 st codon pos 2nd 3rd Qnet – distance based weights
24
Detecting invariant sites The residual sum of squares (RSS) scores give an opportunity to detect invariant sites. Remove constant sites in order to – Idea 1: Minimise sum of RSS – Idea 2: Minimise minimum RSS
25
15,000 sites of which 5000 are invariable proportion of constant sites out of 10,000 variable sites was 0.58 constant sitesPP:sum RSSmin RSS 0.720.250.440.302.22E-097.14E-10 0.700.280.400.321.74E-095.40E-10 0.680.310.350.331.45E-093.81E-10 0.660.370.300.331.20E-092.46E-10 0.640.450.250.309.90E-101.37E-10 0.620.590.180.238.31E-105.85E-11 0.600.800.090.117.40E-101.33E-11 0.570.980.01 7.07E-102.06E-14 0.550.970.02 7.15E-101.29E-11 0.520.820.090.087.38E-104.26E-11 0.500.690.170.147.47E-107.76E-11
26
Vagaries of real data Dealing sensibly with missing or ambiguous data Currently remove all sites with questions marks, gaps or ambiguities over the whole alignment Seems better to do this on a per quartet basis
27
Code R code Python code, creates output that can be understood by Qnet
28
Simulation plans Compare to likelihood Compare to NJ with log-det distances Look at rates across sites instead of just proportions of invariant sites
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.