1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)
2 Empirical findings: variation among genes “Important” proteins evolve slower unimportantones than “unimportant” ones.
3 Histone H4 protein
4 Empirical findings: variation among genes Functional/structural regions evolve slower than nonfunctional/nonstructural regions.
5 Conservation = functional/structural importance/constraints
6 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. *:**** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG **************:******** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV.**. ** * * ***** Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:******* Alignment preproinsulin
7
8
9 Conserved sites: Important for the function or structure Important for the function or structure Not allowed to mutate Not allowed to mutate “Slow evolving” sitesLow rate of evolution Variable sites: Less important (usually) Change more easily “Fast evolving” sitesHigh rate of evolution Conservation based inference
10 Detecting conservation: Detecting conservation: Evolutionary rates d Rate (~speed) = distance / time Distance = number of substitutions per site Time = 2*#years (doubled because the sequences evolved independently
11 Mean Rate of Nucleotide Substitution in Mammalian Genomes Evolution is a very slow process at the molecular level (“Nothing happens…”) ~10 -9 Substitutions/site/year
12 Rate computation HumanDMAAHAM ChimpDEAAGGC CowDQAAWAP FishDLAACAL S. cerevisiae DDGAFAA S. pombe DDGALGE
13 Site-specific rate computation method
14 Using the ConSeq server
15 ConSeq results:
16 Crash course in protein structure
17 Why protein structure? Each protein has a particular 3D structure that determines its function Protein structure is better conserved than protein sequence and more closely related to function Analyzing a protein structure is more informative than analyzing its sequence for function inference
18 Holds 3D models of biological macromolecules (protein, RNA, DNA, small molecules) All data are available to the public X-Ray crystals (84%) NMR models (16%) Submitted by biologists and biochemists from around the world PDB: Protein Data Bank
19 PDB model Defines the 3D coordinates (x,y,z) of each of the atoms in one or more molecules (i.e., complex) There are models of proteins, protein complexes, proteins and DNA, protein segments, etc … The models also include the positions of ligand molecules, solvent molecules, metal ions, etc… PDB code: integer + 3 integers/characters (e.g., 1a14)
20 The PDB file – text format
21 The PDB file – textformat The PDB file – text format ATOM: Usually protein or DNA HETATM: Usually Ligand, ion, water chain Residue identity Residue number Atom number Atom identity The coordinates for each residue in the structure Temperature factor XYZ
22 Viewing structures WireframeSpacefill Backbone
23 Protein core: structurally constrained - usually conserved Active site: functionally constrained - usually conserved Surface loops: tolerant to mutations - usually variable Hydrophobic core Surface loops Conservation in the structure Active site
24 Same algorithm as ConSeq, but here the results are projected onto the 3D structure of the protein
25 Using the ConSurf server
26 ConSurf example: potassium channel An integral membrane protein with sequence similarity to all known K+ channels, particularly in the pore region PDB code: 1bl8, chain A
27 ConSurf results: results
28 Alignment of homologs found by psi-blast: ConSurf example: potassium channel
29 ConSurf results: results
30 ConSurf example: potassium channel Neighbor-Joining reconstructed phylogenetic tree:
31 ConSurf results: results
32 Conservation scores: The scores are standardized: the average score for all residues is zero, and the standard deviation is one The lowest score represents the most conserved site in the protein negative values: slowly evolving (= low evolutionary rate), conserved sites negative values: slowly evolving (= low evolutionary rate), conserved sites The highest score represents the most variable site in the protein positive values: rapidly evolving (= fast evolutionary rate), variable sites positive values: rapidly evolving (= fast evolutionary rate), variable sites
33 ConSurf results: amino-acid conservation scores
34 ConSurf result with First Glance in Jmol:
35 ConSeq/ConSurf user intervention (advanced options) ConSeq/ConSurf user intervention (advanced options) 1. Method of calculating the amino acid conservation scores: Bayesian/Max Likelihood 2. Enter your own MSA file 3. Multiply Align Sequences using: MUSCLE/CLUSTALW 4. Collect the Homologues from: SWISS-PROT/UniProt 5. Max. Number of Homologues (default = 50) 6. No. of PSI-BLAST Iterations (default = 1) 7. PSI-BLAST E-value Cutoff (default = ) 8. Model of substitution for proteins: JTT/Dayhoff/mtREV/cpREV/WAG 9. Enter your own PDB file 10. Enter your own TREE file
36 Codon-level selection ConSeq/ConSurf: Compute the evolutionary rate of amino-acid sites → the data are amino acids. Compute the evolutionary rate of amino-acid sites → the data are amino acids. But, codons encode amino acids… 61 codons vs. 20 amino acids ! Aren’t we loosing information ???
37 Darwin – the theory of natural selection Adaptive evolution: Favorable traits will become more frequent in the population
38 M. Kimura – the neutral theory of molecular evolution Most of the DNA variation between species is neutral with regards to the phenotype Selection operates to preserve a trait
39 Synonymous (silent) and non- synonymous (non-silent) substitutions Silent Non-silent …
40 Synonymous vs. nonsynonymous substitutions UUU → UUC (Phe → Phe ): synonymous UUU → CUU (Phe → Leu): non-synonymous synonymous substitutions = silent substitutions non-synonymous substitutions = non-silent or amino-acid altering substitutions
41 For most proteins, it is observed that the rate of synonymous substitutions is much Higher than the non-synonymous rate purifying selection This is called purifying selection (= conservation this is what ConSeq/Surf are computing ) Synonymous vs. non-synonymous substitutions
42 Synonymous vs. non-synonymous substitutions Structural proteins
43 Saturation of synonymous substitutions Histone H4 between human and wheat: saturation of synonymous substitutions
44 There are rare cases where the non- synonymous rate is much higher than the synonymous rate Positive selection This is called Positive selection Synonymous vs. nonsynonymous substitutions
45 Examples: Proteins of the immune system Pathogen proteins evading the host immune system Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproduction system Positive Selection The hypothesis: Promotes the fitness of the organism
46 Computing synonymous and non- synonymous rates Codon-based MSA: translate DNA to amino acids, align, backtrack to the DNA but keep alignment structure 5 replacements in 10 positions between human and chimp is a lot, but between human and cucumber is nothing Phylogenetic tree: 5 replacements in 10 positions between human and chimp is a lot, but between human and cucumber is nothing Different replacement probabilities between two amino acids: Lys Arg ≠ Lys Cys Positive selection occurs at only a few sites !
47 Inferring positive selection Divide the rate of non-silent substitutions (K a ) by the rate of silent substitutions (K s )
48 Inferring positive selection Basic assumptions: Selection score (Ka/Ks) > 1 ↓ positive selection Selection score (Ka/Ks) < 1 ↓ purifying selection Selection score (Ka/Ks) ~= 1 ↓ neutral selection
49 Not so fast !!! We explicitly assume there is positive selection in the data There is a good chance our model will find a few positively selected sites whatever the case Is this really indicative of positive selection or plain randomness? So, maybe there’s no positive selection after all
50 Statistics helps us to compare between hypotheses H 0 : There’s no positive selection H 1 : There is positive selection H 0 : compute the probability (likelihood) of the data using a model that does not account for positive selection H 1 : compute the probability (likelihood) of the data using a model that does account for positive selection Perform a likelihood ratio test (LRT)
51
52 Using the selecton server
53 Input = a coding sequence at the codon level The user must provide the sequences – no psi-blast option The sequences’ lengths must divide by 3 (ORF) and must not include any stop-codons An alignment should be a codon alignment RevTrans RevTrans
54 Similar to ConSurf optional Nuclear/mitochondria different species Default run: M8(H1) and the M8a(H0)
55 Selecton Example: HIV Protease The Protease is an essential enzyme for viral infectivity PDB ID: 1hxw
56 Selecton Results:
57 Selecton Results:
58 Selecton results:
59 Selection scores (Ka/Ks): The scores are normalized Ka/Ks > 1: positive selected site Ka/Ks <1: purified selected site
60 Coloring scheme: Used for visualization and is based on the continuous Ka/Ks scores. The color grades (1-7): 1 for sites under strong positive selection (yellow) 1 for sites under strong positive selection (yellow) 7 for sites under strong purifying selection (bordeaux) 7 for sites under strong purifying selection (bordeaux) Color coding scheme of Selecton
61