1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)

2 Empirical findings: variation among genes “Important” proteins evolve slower unimportantones than “unimportant” ones.

3 Histone H4 protein

4 Empirical findings: variation among genes Functional/structural regions evolve slower than nonfunctional/nonstructural regions.

5 Conservation = functional/structural importance/constraints

6 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. *:**** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG **************:******** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV.**. ** * * ***** Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:******* Alignment preproinsulin

9   Conserved sites: Important for the function or structure Important for the function or structure Not allowed to mutate Not allowed to mutate “Slow evolving” sitesLow rate of evolution   Variable sites: Less important (usually) Change more easily “Fast evolving” sitesHigh rate of evolution Conservation based inference

10 Detecting conservation: Detecting conservation: Evolutionary rates d Rate (~speed) = distance / time Distance = number of substitutions per site Time = 2*#years (doubled because the sequences evolved independently

11 Mean Rate of Nucleotide Substitution in Mammalian Genomes Evolution is a very slow process at the molecular level (“Nothing happens…”) ~10 -9 Substitutions/site/year

12 Rate computation 1234567 HumanDMAAHAM ChimpDEAAGGC CowDQAAWAP FishDLAACAL S. cerevisiae DDGAFAA S. pombe DDGALGE

13 http://conseq.tau.ac.il Site-specific rate computation method

14 Using the ConSeq server

15 ConSeq results:

16 Crash course in protein structure

17 Why protein structure?  Each protein has a particular 3D structure that determines its function  Protein structure is better conserved than protein sequence and more closely related to function  Analyzing a protein structure is more informative than analyzing its sequence for function inference

18   Holds 3D models of biological macromolecules (protein, RNA, DNA, small molecules)   All data are available to the public   X-Ray crystals (84%) NMR models (16%)   Submitted by biologists and biochemists from around the world PDB: Protein Data Bank http://www.rcsb.org

19 PDB model  Defines the 3D coordinates (x,y,z) of each of the atoms in one or more molecules (i.e., complex)  There are models of proteins, protein complexes, proteins and DNA, protein segments, etc …  The models also include the positions of ligand molecules, solvent molecules, metal ions, etc…  PDB code: integer + 3 integers/characters (e.g., 1a14)

20 The PDB file – text format

21 The PDB file – textformat The PDB file – text format ATOM: Usually protein or DNA HETATM: Usually Ligand, ion, water chain Residue identity Residue number Atom number Atom identity The coordinates for each residue in the structure Temperature factor XYZ

22 Viewing structures WireframeSpacefill Backbone

23 Protein core: structurally constrained - usually conserved Active site: functionally constrained - usually conserved Surface loops: tolerant to mutations - usually variable Hydrophobic core Surface loops Conservation in the structure Active site

24 http://consurf.tau.ac.il Same algorithm as ConSeq, but here the results are projected onto the 3D structure of the protein

25 Using the ConSurf server

26 ConSurf example: potassium channel  An integral membrane protein with sequence similarity to all known K+ channels, particularly in the pore region  PDB code: 1bl8, chain A

27 ConSurf results: results

28  Alignment of homologs found by psi-blast: ConSurf example: potassium channel

30 ConSurf example: potassium channel  Neighbor-Joining reconstructed phylogenetic tree:

32 Conservation scores:  The scores are standardized: the average score for all residues is zero, and the standard deviation is one  The lowest score represents the most conserved site in the protein negative values: slowly evolving (= low evolutionary rate), conserved sites negative values: slowly evolving (= low evolutionary rate), conserved sites  The highest score represents the most variable site in the protein positive values: rapidly evolving (= fast evolutionary rate), variable sites positive values: rapidly evolving (= fast evolutionary rate), variable sites

33 ConSurf results: amino-acid conservation scores

34 ConSurf result with First Glance in Jmol:

35 ConSeq/ConSurf user intervention (advanced options) ConSeq/ConSurf user intervention (advanced options) 1. Method of calculating the amino acid conservation scores: Bayesian/Max Likelihood 2. Enter your own MSA file 3. Multiply Align Sequences using: MUSCLE/CLUSTALW 4. Collect the Homologues from: SWISS-PROT/UniProt 5. Max. Number of Homologues (default = 50) 6. No. of PSI-BLAST Iterations (default = 1) 7. PSI-BLAST E-value Cutoff (default = 0.001 ) 8. Model of substitution for proteins: JTT/Dayhoff/mtREV/cpREV/WAG 9. Enter your own PDB file 10. Enter your own TREE file

36 Codon-level selection  ConSeq/ConSurf: Compute the evolutionary rate of amino-acid sites → the data are amino acids. Compute the evolutionary rate of amino-acid sites → the data are amino acids.  But, codons encode amino acids…  61 codons vs. 20 amino acids !  Aren’t we loosing information ???

37 Darwin – the theory of natural selection  Adaptive evolution: Favorable traits will become more frequent in the population

38 M. Kimura – the neutral theory of molecular evolution Most of the DNA variation between species is neutral with regards to the phenotype Selection operates to preserve a trait

39 Synonymous (silent) and non- synonymous (non-silent) substitutions Silent Non-silent …

40 Synonymous vs. nonsynonymous substitutions UUU → UUC (Phe → Phe ): synonymous UUU → CUU (Phe → Leu): non-synonymous synonymous substitutions = silent substitutions non-synonymous substitutions = non-silent or amino-acid altering substitutions

41 For most proteins, it is observed that the rate of synonymous substitutions is much Higher than the non-synonymous rate purifying selection This is called purifying selection (= conservation this is what ConSeq/Surf are computing ) Synonymous vs. non-synonymous substitutions

42 Synonymous vs. non-synonymous substitutions Structural proteins

43 Saturation of synonymous substitutions Histone H4 between human and wheat: saturation of synonymous substitutions

44 There are rare cases where the non- synonymous rate is much higher than the synonymous rate Positive selection This is called Positive selection Synonymous vs. nonsynonymous substitutions

45 Examples:  Proteins of the immune system  Pathogen proteins evading the host immune system  Pathogen proteins that are drug targets  Proteins that are products of gene duplication  Proteins involved in the reproduction system Positive Selection The hypothesis: Promotes the fitness of the organism

46 Computing synonymous and non- synonymous rates Codon-based MSA: translate DNA to amino acids, align, backtrack to the DNA but keep alignment structure 5 replacements in 10 positions between human and chimp is a lot, but between human and cucumber is nothing Phylogenetic tree: 5 replacements in 10 positions between human and chimp is a lot, but between human and cucumber is nothing Different replacement probabilities between two amino acids: Lys  Arg ≠ Lys  Cys Positive selection occurs at only a few sites !

47 Inferring positive selection Divide the rate of non-silent substitutions (K a ) by the rate of silent substitutions (K s )

48 Inferring positive selection Basic assumptions: Selection score (Ka/Ks) > 1 ↓ positive selection Selection score (Ka/Ks) < 1 ↓ purifying selection Selection score (Ka/Ks) ~= 1 ↓ neutral selection

49 Not so fast !!!  We explicitly assume there is positive selection in the data  There is a good chance our model will find a few positively selected sites whatever the case  Is this really indicative of positive selection or plain randomness? So, maybe there’s no positive selection after all

50 Statistics helps us to compare between hypotheses  H 0 : There’s no positive selection  H 1 : There is positive selection  H 0 : compute the probability (likelihood) of the data using a model that does not account for positive selection  H 1 : compute the probability (likelihood) of the data using a model that does account for positive selection  Perform a likelihood ratio test (LRT)

51 http://selecton.tau.ac.il

52 Using the selecton server

53 Input = a coding sequence at the codon level  The user must provide the sequences – no psi-blast option  The sequences’ lengths must divide by 3 (ORF) and must not include any stop-codons  An alignment should be a codon alignment RevTrans RevTrans

54 Similar to ConSurf optional Nuclear/mitochondria different species Default run: M8(H1) and the M8a(H0)

55 Selecton Example: HIV Protease The Protease is an essential enzyme for viral infectivity PDB ID: 1hxw

56 Selecton Results:

57 Selecton Results:

58 Selecton results:

59 Selection scores (Ka/Ks):  The scores are normalized  Ka/Ks > 1: positive selected site  Ka/Ks <1: purified selected site

60 Coloring scheme:  Used for visualization and is based on the continuous Ka/Ks scores.  The color grades (1-7): 1 for sites under strong positive selection (yellow) 1 for sites under strong positive selection (yellow) 7 for sites under strong purifying selection (bordeaux) 7 for sites under strong purifying selection (bordeaux) Color coding scheme of Selecton

1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)

Similar presentations

Presentation on theme: "1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)

Similar presentations

Presentation on theme: "1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)"— Presentation transcript:

Similar presentations

About project

Feedback