Lecture 11 – Increasing Model Complexity

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Statistical Techniques I
Brief Review –Forecasting for 3 weeks –Simulation Motivation for building simulation models Steps for developing simulation models Stochastic variables.
Measuring the degree of similarity: PAM and blosum Matrix
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Heuristic alignment algorithms and cost matrices
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Sequence similarity.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
Week 10 Nov 3-7 Two Mini-Lectures QMM 510 Fall 2014.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Biol 304 Week 3 Equilibrium Binding Multiple Multiple Binding Sites.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,
Part 4 – Methods and Models. FileRef Guy Carpenter Methods and Models Method is an algorithm – a series of steps to follow  Chain ladder, BF, Cape Cod,
Chapter 4: Variability. Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Copyright © Cengage Learning. All rights reserved. 3 Discrete Random Variables and Probability Distributions.
Theoretical distributions: the Normal distribution.
Lecture 6 Genetic drift & Mutation Sonja Kujala
Pairwise Sequence Alignment and Database Searching
Lesson 8: Basic Monte Carlo integration
Discrete Random Variables and Probability Distributions
Chapter 7. Classification and Prediction
CORRELATION.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
BAE 5333 Applied Water Resources Statistics
Lecture 10 – Models of DNA Sequence Evolution
Maximum likelihood (ML) method
Measures of Central Tendency
S2 Chapter 6: Populations and Samples
The +I+G Models …an aside.
Virtual University of Pakistan
Distances.
Models of Sequence Evolution
Physics 114: Exam 2 Review Material from Weeks 7-11
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
Tabulations and Statistics
Bayesian Models in Machine Learning
Basic Training for Statistical Process Control
Summary and Recommendations
Basic Training for Statistical Process Control
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Honors Statistics From Randomness to Probability
CH2. Cleaning and Transforming Data
Statistics II: An Overview of Statistics
Lecture 10 – Models of DNA Sequence Evolution
Summary and Recommendations
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Lecture 11 – Increasing Model Complexity Differences in functional and structural constraints across sites leads to different sites evolving at different rates. 3rd-codon positions evolve fastest, followed by 1st positions, & 2nd-position sites evolve the slowest. Among-site rate variation exacerbates the loss of historical information caused by multiple hits. 10 substitutions, and they are distributed randomly across 50 sites, there should only rarely be more than a single substitution per site. However if those 10 substitutions are distributed across 50 sites in a non-random fashion, say concentrated to 1/3 of them, many more will occur at multiply hit sites.

Among-site rate variation Discrete methods – assign sites to a series of rate categories (or partitions). Add a relative rate parameter, r, to our models. There are c rate categories, and wr is the probability that site i belongs to a particular rate category; these are binary (0 or 1) if we’re assigning sites to rate classes

Site-Specific Rates (or SSR) model In the SSR models, the theoretical limit to the number of rate categories is the number of sites in the alignment, but usually these are determined a priori and often they follow codon structure. So in this case, w1, w2, and w3 are fixed to 0 or 1, and we just use a different relative rate for each class (e.g., codon position). The relative rate parameters then can be assigned or they can be optimized numerically, which is what is usually done. Advantage: one also can use a different transformation matrix (Q) for each class. Disadvantage: that all sites within a category are assumed to be evolving a uniform rate.

Invariable Sites Model This is based on observations that there are sites in alignments of conserved genes which all life seem to have the same state. This allows two rate categories and in one of these, the relative-rate parameter is zero. We can think about this model in two ways. Mixture model winvar = pinvar This is the probability that a site is in the class where r = 0. wvar = pvar The probability that the site is in the class where r ≠ 0. wvar = 1 - winvar Sites that are observed to vary have winvar = 0. Invariable sites pinvar is the proportion of sites across an alignment that are not free to vary. ≤ the proportion of sites that are observed constant.

and we can use rate-mixture models to deal with this. Continuous Methods There’s no biological reason to expect rates to fall into discrete categories, and we can use rate-mixture models to deal with this. Gamma distributed rates. shape parameter (a) & scale parameter (b); mean = ab we set the mean of the G-distribution equal to 1 by constraining b = 1/a.

Discretizing the G-distribution Gamma distribution with shape parameter (a) = 0.49 Cut-points and category rates for discrete G approximation with ncat (or c) = 4.   ------ cut-points ------ cat lower upper rate (mean) -------------------------------------------------------------- 1 0.00000000 0.09804816 0.03191473 2 0.09804816 0.44841399 0.24666120 3 0.44841399 1.31969682 0.81435904 4 1.31969682 infinity 2.90706503 So r1 = 0.0319, r2 = 0.2467, r3 = 0.8144, & r4 = 2.907; note these sum to 4 (=ncat). Each site has some non-zero probability of belonging to each rate class: wr is optimized for each of the c classes.

Discretizing the G-distribution Cut-points and category rates for dG w/ncat = 8 ------ cut-points ------ cat lower upper rate (mean) --------------------------------------------------------------- 1 0.00000000 0.02338747 0.00768838 2 0.02338747 0.09804816 0.05614108 3 0.09804816 0.23352213 0.16013076 4 0.23352213 0.44841399 0.33319164 5 0.44841399 0.78071211 0.60229167 6 0.78071211 1.31969682 1.02642641 7 1.31969682 2.35886822 1.77009489 8 2.35886822 infinity 4.04403517 The more highly we discretize the G, the shorter the runs, but: a ncat Poor estimates of a with ncat < 12

Properties of G models This is done across the entire data set, so essentially we take the same transformation matrix (Q) for each site and scale it by the average rate for each category. This has the large advantage of being able to accommodate such a high diversity of rates with just a single parameter, a. Some sites can be so slowly evolving to have a high probability of stasis, yet others (perhaps adjacent) may be free to evolve rapidly. It has the disadvantage that we apply the same transformation matrix uniformly across a data set – one set of base frequencies and one R matrix. We may get a much better fit if we allow very different Q matrices. Furthermore, SSLs for each site are calculated many times (ncat times), so the better we approximate a continuous G, the longer our run times. It’s important to note that this is actually a rather constrained mixture model.

I+G Models This is intuitively very appealing when one considers that, at least from some genes, there’s a set of sites that are constant across essentially the tree of life. pinvar= 0 I+G = G a = ∞ I+G = I pinvar= 0 & a = ∞ I+G = ER There are some issues with it that are sometimes not appreciated.

I+G Models First, both the mixed model and the gamma alone expect there to be many constant sites. It can be very difficult to discern the sites that are truly invariable from those potentially variable sites that are evolving slowly enough to have a high probability of stasis

The GTR+I+G family of models There are 1624 possible special cases of GTR+I+G There are 10 parameters in the full model: 3 free b.f. (p) 5 relative rates (R) 2 rate variation parameters (pinvar & a)

Another look at a GTR+SSR3 pA1 pA2 pA3 pC1 pC2 pC3 pG1 pG2 pG3 pT1 pT2 pT3 r(AC)1 r(AC)2 r(AC)1 r(AG)1 r(AG)2 r(AG)3 r(AT)1 r(AT)2 r(AT)3 r(CG)1 r(CG)2 r(CG)3 r(CT)1 r(CT)2 r(CT)3 r(GT)1 r(GT)2 r(GT)3 9 free base frequencies 15 relative rate parameters pA1 = pA2 = pA3 pC1 = pC2 = pC3 pG1 = pG2 = pG3 pT1 = pT2 = pT3 r(AC)1 = r(AC)2 = r(AC)1 r(AG)1 = r(AG)2 = r(AG)3 r(AT)1 = r(AT)2 = r(AT)3 r(CG)1 = r(CG)2 = r(CG)3 r(CT)1 = r(CT)2 = r(CT)3 r(GT)1 = r(GT)2 = r(GT)3 Single GTR

Estimate relative rates for each site on a starting tree. GTR+CAT in RAxML Estimate relative rates for each site on a starting tree. Lump sites with similar relative rates into categories. So now, wr = 0 or 1, & r is set to value of highest lnL site in category.

rRNA Models Non-independence of sites. a priori partitioning of sites into stem and loop regions, and sites in the loops partition are treated with some variant of the GTR+I+G family. Sites in the stem regions are treated using a doublet model. Doublets are treated as characters rather than nucleotides and there are 16 states. So there are 120 reversible substitution types. This is very parameter rich (how many parameters?) and we’re forced to use empirical models.

Again, empirical matrices can be used. Codon Models In-frame triplets are used as characters and there are 61 possible character states. Thus the transformation matrix has 3660 rate parameters (or 1830 in the reversible case). Again, empirical matrices can be used. Alternatively, cells of the transformation matrix can be restricted so that there are only, say, two substitution types. E.g., TTT  TTC both code for Phe, so this is a silent substitution. apC, where a is the rate of silent substitutions and pC is (as before) the frequency of nucleotide C. TTT TTA results in an amino acid replacement. bpA, where b is the rate of amino acid replacement substitutions. In this example there are 4 free parameters (3 b.f. and a rate ratio).