Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Phylogenetic Trees Lecture 4
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Clock I. Evolutionary rate Xuhua Xia
Molecular Evolution Revised 29/12/06
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Heuristic alignment algorithms and cost matrices
1 Detecting selection using phylogeny. 2 Evaluation of prediction methods  Comparing our results to experimentally verified sites Positive (hit)Negative.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Scott Williamson and Carlos Bustamante
Positive selection A new allele (mutant) confers some increase in the fitness of the organism Selection acts to favour this allele Also called adaptive.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
Probabilistic methods for phylogenetic trees (Part 2)
Adaptive Molecular Evolution Nonsynonymous vs Synonymous.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
1 Patterns of Substitution and Replacement. 2 3.
Molecular phylogenetics
In the deterministic model, the time till fixation depends on the selective advantage, but fixation is guaranteed.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
Tree Inference Methods
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
PHYLOGENETICS CONTINUED TESTS BY TUESDAY BECAUSE SOME PROBLEMS WITH SCANTRONS.
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
Pairwise Sequence Analysis-III
Estimating evolutionary parameters for Neisseria meningitidis Based on the Czech MLST dataset.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Selectionist view: allele substitution and polymorphism
N=50 s=0.150 replicates s>0 Time till fixation on average: t av = (2/s) ln (2N) generations (also true for mutations with negative “s” ! discuss among.
MODELLING EVOLUTION TERESA NEEMAN STATISTICAL CONSULTING UNIT ANU.
NEW TOPIC: MOLECULAR EVOLUTION.
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
In populations of finite size, sampling of gametes from the gene pool can cause evolution. Incorporating Genetic Drift.
Modelling evolution Gil McVean Department of Statistics TC A G.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Evolutionary Change in Sequences
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
LBA ProtPars. LBA Prot Dist no Gamma and no alignment.
Causes of Variation in Substitution Rates
The neutral theory of molecular evolution
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
Distances.
Models of Sequence Evolution
Biological Classification: The science of taxonomy
Molecular basis of evolution.
What are the Patterns Of Nucleotide Substitution Within Coding and
Molecular Clocks Rose Hoberman.
Summary and Recommendations
Pedir alineamiento múltiple
The Most General Markov Substitution Model on an Unrooted Tree
Summary and Recommendations
Presentation transcript:

Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004

Today McClellan and McCracken: Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains Dagan et al: Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional FActors and May Not be Indicative of Positive Darwinian Selection Halpert and Bruno: Evolutionary Distances for Protein- Coding Sequences: Modeling Site-Specific Residue Frequencies

Types of Selection negative purifying selection –non-synonymous codon changes are selected against neutral selection –non-synonymous changes in codons have an equivalent probability of elimination or fixation positive diversifying selection –non-synonymous codon changes are selected for

Identifying Regions Under Selective Pressure ds/dn > 1 commonly used synonymous substitutions become saturated more quickly than ns compare conservative/radical substitution ratio to expected distribution under neutral model

A “conservative” definition Cluster amino acids according to physio- chemical properties –Charge –Volume –Polarity –Grantham’s distance –... Within-class = conservative Across-class = radical

Assessing Substitution Rates 2 sequences –average over all possible pathways between two codons –TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg) Many sequences –Build a phylogenetic tree –Infer most likely ancestral sequences –Count synonymous and nonsynonymous substitutions

Cytochrome b Gene Evolution Matrix and Transmembrane regions have comparable rates of change Intermembrane region has lower rate of change (McClelland and McCracken)

5 Properties 4 Groups Neutral model –based only on codon frequencies Chi-squared test –observed vs. expected (given domain amino acid frquencies) Group Non-Syn Mutations

Question Do factors unrelated to selection affect the radical/conservative ratio? –nucleotide frequencies e.g. GC content –transition/transversion ratio transitions (A->G and T->C) are more common than transversion –distances between amino acids genetic code –codon biases due to tRNA availibility, energy usage, or pathogen avoidance –amino acid frequencies ??

An Initial Test 3 proteins: Hemoglobin, Interleukin, Ribosomal protein Simulated neutral evolution using substutition matrix built from psuedogenes Tested for selection pressure –volume/polarity: 100% FP –grantham: 13-21% FP –charge: 0% FP (Dagan et al)

Simulation Study Generate virtual ancestral sequence –300 nt long Set mutational/compositional parameters Simulate evolution (ROSE software) –50 substitutions Calculate conservative/radical ratio Each parameter set simulated 50 times

ANOVA

Conclusion Many composition and mutation factors influence conservative/radical ratio Poor indicator of positive selection

Correlation or Causation? Many factors are correlated, but direction of causation is undetermined –transitions more likely to cause conservative changes than transversions –codon bias can influence nucleotide frequencies –purifying selective pressure will reduce the rate of change Generative models which model many of these relevant factors

Generative Models of Gene/Protein Evolution Infer relative distances between sequences Build a phylogenetic tree Infer which positions are under positive selective pressure Find additional homologous proteins Identify co-varying sites

Modeling Evolutionary Processes Most models –homogeneous, time- reversible Markov models Simplest models –DNA mutation models –nucleotide frequencies –transition/transversion ratio

Too Simplistic positions within codons are not independent –codon or amino acid models parameters not sufficient to explain different rates of change between specific characters –empirical substitution matrix (e.g. PAM) site-specific rates of change –use a gamma distribution to model variation in rates

Too Simplistic positions within codons clearly not independent –codon or amino acid models different rates of change between specific characters –empirical substitution matrix (e.g. PAM) site-specific rates of change –use a gamma distribution to model variation in rates equilibrium frequencies are also site-specific –due to functional or structural constraints

Too Simplistic positions within codons are not independent –codon or amino acid models parameters not sufficient to explain different rates of change between specific characters –empirical substitution matrix (e.g. PAM) site-specific rates of change –use a gamma distribution to model variation in rates equilibrium frequencies are also site-specific –due to functional or structural constraints

Halpern & Bruno 1998 A codon-based model of evolution 1.site-invariant dna-based mutation model 2.site-specific amino acid level selection model = probability of mutation = probability of fixation at site i

Halpern & Bruno 1998 Assumptions –most importantly, selectional pressures are constant at a given position for all lineages over all times –sites independent –markov process is reversible Does not model –selection at the codon level codon bias DNA or RNA structural requirements –uncertainty in MSA

Calculating fixation rates (Kimura 1962) relative fitness of b to a population size

Fixation rates in terms of equilibrium rates and mutation probabilities (Kimura 1962) relative fitness of b to a population size

p is estimated from nucleotide frequencies and the transition/transversion ratio π represents the frequency of each codon, and is approximated via amino acid and nucleotide frequencies model ignores: – site-specific nucleic acid selection effects (e.g. from RNA structure) – codon bias A Simpler Formulation

Model Fallout Amount of “flux” between two codons depends on their relative fitness Rates are not explicitly modeled, but... –maximum substitution rate will be when all codons are equally fit –synonymous codons will have highest flux –because of degeneracy of 3 rd position changes, they will be most frequent

Parameter Estimation Ideal –estimate parameters simultaneously from large data set What they did –nucleotide frequencies: from observed frequencies –transition/transversion ratio: using existing nucleotide- based methods –equilibrium amino acid frequencies: estimate number of times each amino acid was introduced at each position (based on phylogenetic tree but ignores genetic code) add psuedo-counts

Evaluation Their hypothesis: –methods that only model differing rates will underestimate more remote divergence times Test hypothesis on simulated data –given an MSA estimate the tree (multiplied branch lengths by 6.0) estimate amino acid frequencies –arbitrarily choose mutational parameters –stochastically generate sequences (how many?)

Predicting Distances Between Sequences A: DNA model (learned?) B: DNA model with site-rate variation C: this model with simulation parameters D: this model with parameters estimated from simulated data x axis: estimated distances y axis: true distances (based on simulation)

Conclusions failing to model selection effects leads to substantial underestimation of longer distances possible to estimate equilibrium amino acid frequencies from realistic data sets with an accuracy sufficient for estimating distances between highly divergenct sequences model accounts for heterogeneity of rates in a novel, and more biologically realistic way model parameters could in theory be estimated simultaneously using ML or bayesian estimation