Lecture 11 – Increasing Model Complexity

Lecture 11 – Increasing Model Complexity
Differences in functional and structural constraints across sites leads to different sites evolving at different rates. 3rd-codon positions evolve fastest, followed by 1st positions, & 2nd-position sites evolve the slowest. Among-site rate variation exacerbates the loss of historical information caused by multiple hits. 10 substitutions, and they are distributed randomly across 50 sites, there should only rarely be more than a single substitution per site. However if those 10 substitutions are distributed across 50 sites in a non-random fashion, say concentrated to 1/3 of them, many more will occur at multiply hit sites.

Among-site rate variation
Discrete methods – assign sites to a series of rate categories (or partitions). Add a relative rate parameter, r, to our models. There are c rate categories, and wr is the probability that site i belongs to a particular rate category; these are binary (0 or 1) if we’re assigning sites to rate classes

Site-Specific Rates (or SSR) model
In the SSR models, the theoretical limit to the number of rate categories is the number of sites in the alignment, but usually these are determined a priori and often they follow codon structure. So in this case, w1, w2, and w3 are fixed to 0 or 1, and we just use a different relative rate for each class (e.g., codon position). The relative rate parameters then can be assigned or they can be optimized numerically, which is what is usually done. Advantage: one also can use a different transformation matrix (Q) for each class. Disadvantage: that all sites within a category are assumed to be evolving a uniform rate.

Invariable Sites Model
This is based on observations that there are sites in alignments of conserved genes which all life seem to have the same state. This allows two rate categories and in one of these, the relative-rate parameter is zero. We can think about this model in two ways. Mixture model winvar = pinvar This is the probability that a site is in the class where r = 0. wvar = pvar The probability that the site is in the class where r ≠ 0. wvar = 1 - winvar Sites that are observed to vary have winvar = 0. Invariable sites pinvar is the proportion of sites across an alignment that are not free to vary. ≤ the proportion of sites that are observed constant.

and we can use rate-mixture models to deal with this.
Continuous Methods There’s no biological reason to expect rates to fall into discrete categories, and we can use rate-mixture models to deal with this. Gamma distributed rates. shape parameter (a) & scale parameter (b); mean = ab we set the mean of the G-distribution equal to 1 by constraining b = 1/a.

Discretizing the G-distribution
Gamma distribution with shape parameter (a) = 0.49 Cut-points and category rates for discrete G approximation with ncat (or c) = 4. cut-points cat lower upper rate (mean) infinity So r1 = , r2 = , r3 = , & r4 = 2.907; note these sum to 4 (=ncat). Each site has some non-zero probability of belonging to each rate class: wr is optimized for each of the c classes.

Discretizing the G-distribution
Cut-points and category rates for dG w/ncat = 8 cut-points cat lower upper rate (mean) infinity The more highly we discretize the G, the shorter the runs, but: a ncat Poor estimates of a with ncat < 12

Properties of G models This is done across the entire data set, so essentially we take the same transformation matrix (Q) for each site and scale it by the average rate for each category. This has the large advantage of being able to accommodate such a high diversity of rates with just a single parameter, a. Some sites can be so slowly evolving to have a high probability of stasis, yet others (perhaps adjacent) may be free to evolve rapidly. It has the disadvantage that we apply the same transformation matrix uniformly across a data set – one set of base frequencies and one R matrix. We may get a much better fit if we allow very different Q matrices. Furthermore, SSLs for each site are calculated many times (ncat times), so the better we approximate a continuous G, the longer our run times. It’s important to note that this is actually a rather constrained mixture model.

I+G Models This is intuitively very appealing when one considers that, at least from some genes, there’s a set of sites that are constant across essentially the tree of life. pinvar= 0 I+G = G a = ∞ I+G = I pinvar= 0 & a = ∞ I+G = ER There are some issues with it that are sometimes not appreciated.

I+G Models First, both the mixed model and the gamma alone expect there to be many constant sites. It can be very difficult to discern the sites that are truly invariable from those potentially variable sites that are evolving slowly enough to have a high probability of stasis

The GTR+I+G family of models
There are 1624 possible special cases of GTR+I+G There are 10 parameters in the full model: 3 free b.f. (p) 5 relative rates (R) 2 rate variation parameters (pinvar & a)

Another look at a GTR+SSR3
pA1 pA2 pA3 pC1 pC2 pC3 pG1 pG2 pG3 pT1 pT2 pT3 r(AC)1 r(AC)2 r(AC)1 r(AG)1 r(AG)2 r(AG)3 r(AT)1 r(AT)2 r(AT)3 r(CG)1 r(CG)2 r(CG)3 r(CT)1 r(CT)2 r(CT)3 r(GT)1 r(GT)2 r(GT)3 9 free base frequencies 15 relative rate parameters pA1 = pA2 = pA3 pC1 = pC2 = pC3 pG1 = pG2 = pG3 pT1 = pT2 = pT3 r(AC)1 = r(AC)2 = r(AC)1 r(AG)1 = r(AG)2 = r(AG)3 r(AT)1 = r(AT)2 = r(AT)3 r(CG)1 = r(CG)2 = r(CG)3 r(CT)1 = r(CT)2 = r(CT)3 r(GT)1 = r(GT)2 = r(GT)3 Single GTR

Estimate relative rates for each site on a starting tree.
GTR+CAT in RAxML Estimate relative rates for each site on a starting tree. Lump sites with similar relative rates into categories. So now, wr = 0 or 1, & r is set to value of highest lnL site in category.

rRNA Models Non-independence of sites.
a priori partitioning of sites into stem and loop regions, and sites in the loops partition are treated with some variant of the GTR+I+G family. Sites in the stem regions are treated using a doublet model. Doublets are treated as characters rather than nucleotides and there are 16 states. So there are 120 reversible substitution types. This is very parameter rich (how many parameters?) and we’re forced to use empirical models.

Again, empirical matrices can be used.
Codon Models In-frame triplets are used as characters and there are 61 possible character states. Thus the transformation matrix has 3660 rate parameters (or 1830 in the reversible case). Again, empirical matrices can be used. Alternatively, cells of the transformation matrix can be restricted so that there are only, say, two substitution types. E.g., TTT  TTC both code for Phe, so this is a silent substitution. apC, where a is the rate of silent substitutions and pC is (as before) the frequency of nucleotide C. TTT TTA results in an amino acid replacement. bpA, where b is the rate of amino acid replacement substitutions. In this example there are 4 free parameters (3 b.f. and a rate ratio).

Lecture 11 – Increasing Model Complexity

Similar presentations

Presentation on theme: "Lecture 11 – Increasing Model Complexity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 11 – Increasing Model Complexity

Similar presentations

Presentation on theme: "Lecture 11 – Increasing Model Complexity"— Presentation transcript:

Similar presentations

About project

Feedback