1 HW Clarifications Homology implies shared ancestry Partial sequence identity does not necessarily imply homology A high coverage of sequence identity can imply homology Identity and Homology
2 HW Clarifications Insertions and Deletions
3 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)
4 Empirical findings of conservation variation among sites: Functional/Structural sites evolve slowerthan nonfunctional/nonstructural sites
5 Conservation = functional/structural importance
6 Histone 3 protein
7 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. : **** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG **************:***** ** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV.**. ** * * ***** Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:******* Alignment pre-pro-insulin
8 <>
9
10 Conserved sites: Important for the function or structure Important for the function or structure Not allowed to mutate Not allowed to mutate “Slow evolving” sitesLow rate of evolution Variable sites: Less important (usually) Change more easily “Fast evolving” sitesHigh rate of evolution Conservation based inference
11 Detecting conservation: Detecting conservation: Evolutionary rates d Rate = distance/time Distance = number of substitutions per site Time = 2*#years (doubled because the sequences evolved independently)
12 Rate computation HumanDMAAHAM ChimpDEAAGGC CowDQAAWAP FishDLAACAL S. cerevisiae DDGAFAA S. pombe DDGALGE MSAPhylogeny Evolutionary Model
13 Site-specific rate computation tool
14 Locating the active site of Pyruvate kinase Glycolysis pathway
15
16
17
18 Conservation scores: The scores are standardized: the average score of all residues is 0, and the standard deviation is 1 Negative values: slowly evolving (= low evolutionary rate). conserved sites The most conserved site in the protein has the lowest score The most conserved site in the protein has the lowest score Positive values: rapidly evolving (= fast evolutionary rate). variable sites The most variable site in the protein has the highest score The most variable site in the protein has the highest score Scores are relative to the protein and cannot be compared between different proteins!!!
19
20 SWISS-PROT
21 Combining protein structure Each protein has a particular 3D structure that determines its function Protein structure is better conserved than protein sequence and more closely related to function Analyzing a protein structure is more informative than analyzing its sequence for function inference
22 Protein core: structurally constrained - usually conserved Active site: functionally constrained - usually conserved Surface: tolerant to mutations - usually variable Core Surface Conservation in the structure Active site
23 Same algorithm as ConSeq, but here the results are projected onto the 3D structure of the protein
24 The structure-function of the potassium channel transmembrane region cytoplasm
25
26
27
28
29 ConSeq/ConSurf user intervention (advanced options) ConSeq/ConSurf user intervention (advanced options) 1. Choosing the method for calculating the amino-acid conservation scores: (Bayesian/Max’ Likelihood) 2. Entering your own MSA file 3. Performing the MSA using: (MUSCLE/CLUSTALW) 4. Collecting the homologs from: (SWISS-PROT/UniProt) 5. Max. number of homologs: (50) 6. No. of PSI-BLAST iterations: (1) 7. PSI-BLAST 3-value cutoff: (0.001 ) 8. Model of substitution for proteins: (JTT/Dayhoff/mtREV/cpREV/WAG) 9. Entering your own PDB file 10. Entering your own TREE file
30 Codon-level selection ConSeq/ConSurf: Compute the evolutionary rate of amino-acid sites → the data are amino acids Compute the evolutionary rate of amino-acid sites → the data are amino acids Compute only the rate of non-synonymous substitutions Compute only the rate of non-synonymous substitutions UUU → UUC (Phe → Phe ): synonymous UUU → CUU (Phe → Leu): non-synonymous
31 For most proteins, the rate of synonymous substitutions is much Higher than the non-synonymous rate purifying selection This is called purifying selection (= conservation in ConSeq/Surf ) Synonymous vs. non-synonymous substitutions
32 There are rare cases where the non- synonymous rate is much higher than the synonymous rate positive (Darwinian) selection This is called positive (Darwinian) selection Synonymous vs. nonsynonymous substitutions
33 Examples: Pathogen proteins evading the host immune system Proteins of the immune system detecting pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system Positive Selection The hypothesis: promotes the fitness of the organism
34 Computing synonymous and non- synonymous rates Evolutionary Model Codon MSA Phylogeny
35 Inferring positive selection Look at the ratio between the non-synonymous rate (K a ) and the synonymous rate (K s )
36 Inferring positive selection Ka/Ks < 1purifying selection Ka/Ks > 1positive selection Ka/Ks = 1no selection (neutral)
37 Our evolutionary model assumes there is positive selection in the data By chance alone we expect our model to find a few sites with Ka/Ks >1 Is this really indicative of positive selection or plain randomness? Maybe there’s no positive selection after all? Evolutionary Model Codon MSA Phylogeny
38 Solution: statistically compare between hypotheses H 0 : There’s no positive selection H 1 : There is positive selection H 0 : compute the probability (likelihood) of the data using a model that does not account for positive selection P-value > 0.05 accept H 0 < 0.05 reject H 0 Perform a statistical test to accept or reject H 0 (likelihood ratio test) H 1 : compute the probability (likelihood) of the data using a model that does account for positive selection
39 Note: saturation of synonymous substitutions Human and wheat are too evolutionary remote saturation of synonymous substitutions Pick closer sequences for positive selection analysis Syn. Nonsyn.
40
41 Selecton input Coding sequences - only ORFs No stop codons If an MSA is provided it must be codon aligned (RevTrans) (RevTrans) The user must provide the sequences – no psi-blast option Codon-level sequences !!!
42 Positive selection in the primateTRIM5a
43 PrimateTRIM5a TRIM5α from humans, rhesus monkeys, and African green monkeys are all unable to restrict retroviruses isolated from their own species, yet are able to restrict retroviruses from the other species TRIM5α is an important natural barrier to cross-species retrovirus transmission TRIM5α is in an antagonistic conflict with the retroviral capsid proteins TRIM5α is under positive selection
44 Positive selection analysis
45 Positive selection analysis in Selecton H0H0 H1H1
46 Comparing H 0 and H 1 in Selecton
47 Comparing H 0 and H 1 in Selecton
48
49 Selecton results:
50
51 Results Humanrhesus swaps at sites 332, (SPRY) significantly elevate human resistance to HIV and rhesus resistance to SIV