Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004
Today McClellan and McCracken: Estimating the Influence of Selection on the Variable Amino Acid Sites of the Cytochrome b Protein Functional Domains Dagan et al: Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional FActors and May Not be Indicative of Positive Darwinian Selection Halpert and Bruno: Evolutionary Distances for Protein- Coding Sequences: Modeling Site-Specific Residue Frequencies
Types of Selection negative purifying selection –non-synonymous codon changes are selected against neutral selection –non-synonymous changes in codons have an equivalent probability of elimination or fixation positive diversifying selection –non-synonymous codon changes are selected for
Identifying Regions Under Selective Pressure ds/dn > 1 commonly used synonymous substitutions become saturated more quickly than ns compare conservative/radical substitution ratio to expected distribution under neutral model
A “conservative” definition Cluster amino acids according to physio- chemical properties –Charge –Volume –Polarity –Grantham’s distance –... Within-class = conservative Across-class = radical
Assessing Substitution Rates 2 sequences –average over all possible pathways between two codons –TTG(Leu) - ATG(Met) - AGG(Arg) - AGA(Arg) Many sequences –Build a phylogenetic tree –Infer most likely ancestral sequences –Count synonymous and nonsynonymous substitutions
Cytochrome b Gene Evolution Matrix and Transmembrane regions have comparable rates of change Intermembrane region has lower rate of change (McClelland and McCracken)
5 Properties 4 Groups Neutral model –based only on codon frequencies Chi-squared test –observed vs. expected (given domain amino acid frquencies) Group Non-Syn Mutations
Question Do factors unrelated to selection affect the radical/conservative ratio? –nucleotide frequencies e.g. GC content –transition/transversion ratio transitions (A->G and T->C) are more common than transversion –distances between amino acids genetic code –codon biases due to tRNA availibility, energy usage, or pathogen avoidance –amino acid frequencies ??
An Initial Test 3 proteins: Hemoglobin, Interleukin, Ribosomal protein Simulated neutral evolution using substutition matrix built from psuedogenes Tested for selection pressure –volume/polarity: 100% FP –grantham: 13-21% FP –charge: 0% FP (Dagan et al)
Simulation Study Generate virtual ancestral sequence –300 nt long Set mutational/compositional parameters Simulate evolution (ROSE software) –50 substitutions Calculate conservative/radical ratio Each parameter set simulated 50 times
Conclusion Many composition and mutation factors influence conservative/radical ratio Poor indicator of positive selection
Correlation or Causation? Many factors are correlated, but direction of causation is undetermined –transitions more likely to cause conservative changes than transversions –codon bias can influence nucleotide frequencies –purifying selective pressure will reduce the rate of change Generative models which model many of these relevant factors
Generative Models of Gene/Protein Evolution Infer relative distances between sequences Build a phylogenetic tree Infer which positions are under positive selective pressure Find additional homologous proteins Identify co-varying sites
Modeling Evolutionary Processes Most models –homogeneous, time- reversible Markov models Simplest models –DNA mutation models –nucleotide frequencies –transition/transversion ratio
Too Simplistic positions within codons are not independent –codon or amino acid models parameters not sufficient to explain different rates of change between specific characters –empirical substitution matrix (e.g. PAM) site-specific rates of change –use a gamma distribution to model variation in rates
Too Simplistic positions within codons clearly not independent –codon or amino acid models different rates of change between specific characters –empirical substitution matrix (e.g. PAM) site-specific rates of change –use a gamma distribution to model variation in rates equilibrium frequencies are also site-specific –due to functional or structural constraints
Too Simplistic positions within codons are not independent –codon or amino acid models parameters not sufficient to explain different rates of change between specific characters –empirical substitution matrix (e.g. PAM) site-specific rates of change –use a gamma distribution to model variation in rates equilibrium frequencies are also site-specific –due to functional or structural constraints
Halpern & Bruno 1998 A codon-based model of evolution dna-based mutation model amino acid level selection model = probability of mutation = probability of fixation at site i
Halpern & Bruno 1998 Assumptions –most importantly, selectional pressures are constant at a given position for all lineages over all times –sites independent –markov process is reversible Does not model –selection at the codon level codon bias DNA or RNA structural requirements –uncertainty in MSA
Calculating fixation rates (Kimura 1962) relative fitness of b to a population size
Fixation rates in terms of equilibrium rates and mutation probabilities (Kimura 1962) relative fitness of b to a population size
p is estimated from nucleotide frequencies and the transition/transversion ratio π represents the frequency of each codon, and is approximated via amino acid and nucleotide frequencies model ignores: – site-specific nucleic acid selection effects (e.g. from RNA structure) – codon bias A Simpler Formulation
Model Fallout Amount of “flux” between two codons depends on their relative fitness Rates are not explicitly modeled, but... –maximum substitution rate will be when all codons are equally fit –synonymous codons will have highest flux –because of degeneracy of 3 rd position changes, they will be most frequent
Parameter Estimation Ideal –estimate parameters simultaneously from large data set What they did –nucleotide frequencies: from observed frequencies –transition/transversion ratio: using existing nucleotide- based methods –equilibrium amino acid frequencies: estimate number of times each amino acid was introduced at each position (based on phylogenetic tree but ignores genetic code) add psuedo-counts
Evaluation Their hypothesis: –methods that only model differing rates will underestimate more remote divergence times Test hypothesis on simulated data –given an MSA estimate the tree (multiplied branch lengths by 6.0) estimate amino acid frequencies –arbitrarily choose mutational parameters –stochastically generate sequences (how many?)
Predicting Distances Between Sequences A: DNA model (learned?) B: DNA model with site-rate variation C: this model with simulation parameters D: this model with parameters estimated from simulated data x axis: estimated distances y axis: true distances (based on simulation)
Conclusions failing to model selection effects leads to substantial underestimation of longer distances possible to estimate equilibrium amino acid frequencies from realistic data sets with an accuracy sufficient for estimating distances between highly divergenct sequences model accounts for heterogeneity of rates in a novel, and more biologically realistic way model parameters could in theory be estimated simultaneously using ML or bayesian estimation