1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)

Slides:



Advertisements
Similar presentations
Neutral Theory of Molecular Evolution most base substitutions are selectively neutral drift dominates evolution at the molecular level Under drift, rate.
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.
The Concept of Functional Constraint. The intensity of purifying selection is determined by the degree of intolerance characteristic of a site or a genomic.
EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Molecular Evolution Revised 29/12/06
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
HIV/AIDS as a Microcosm for the Study of Evolution.
Protein Structure, Databases and Structural Alignment
1 Detecting selection using phylogeny. 2 Evaluation of prediction methods  Comparing our results to experimentally verified sites Positive (hit)Negative.
Multiple Sequence Alignment (MSA) and Phylogeny. One of the options to get multiple sequence Fasta file.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
1 Functional prediction in proteins (purifying and positive selection)
Adaptive Molecular Evolution Nonsynonymous vs Synonymous.
1 HW Clarifications Homology implies shared ancestry Partial sequence identity does not necessarily imply homology A high coverage of sequence identity.
Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Laboratory Training for Field Epidemiologists Typing May 2007 Sequencing and Phylogeny.
1 Patterns of Substitution and Replacement. 2 3.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
Protein Sequence Alignment and Database Searching.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
The Biology and Genetic Base of Cancer. 2 (Mutation)
PHYLOGENETICS CONTINUED TESTS BY TUESDAY BECAUSE SOME PROBLEMS WITH SCANTRONS.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Introduction to Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Chapter 24: Molecular and Genomic Evolution CHAPTER 24 Molecular and Genomic Evolution.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
NEW TOPIC: MOLECULAR EVOLUTION.
Construction of Substitution matrices
Molecular evolution Part I: The evolution of macromolecules.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
PatchFinder. The ConSurf web-server calculates the evolutionary rate for each position in the protein. Surface clusters of spatially close & conserved.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
HANDS-ON ConSurf! Web-Server: The ConSurf webserver.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Evolution of gene function
Causes of Variation in Substitution Rates
The neutral theory of molecular evolution
Neutrality Test First suggested by Kimura (1968) and King and Jukes (1969) Shift to using neutrality as a null hypothesis in positive selection and selection.
Linkage and Linkage Disequilibrium
Pipelines for Computational Analysis (Bioinformatics)
Distances.
What are the Patterns Of Nucleotide Substitution Within Coding and
MULTIPLE SEQUENCE ALIGNMENT
Presentation transcript:

1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)

2 Empirical findings: variation among genes “Important” proteins evolve slower unimportantones than “unimportant” ones.

3 Histone H4 protein

4 Empirical findings: variation among genes Functional/structural regions evolve slower than nonfunctional/nonstructural regions.

5 Conservation = functional/structural importance/constraints

6 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. *:**** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG **************:******** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV.**. ** * * ***** Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:******* Alignment preproinsulin

7

8

9   Conserved sites: Important for the function or structure Important for the function or structure Not allowed to mutate Not allowed to mutate “Slow evolving” sitesLow rate of evolution   Variable sites: Less important (usually) Change more easily “Fast evolving” sitesHigh rate of evolution Conservation based inference

10 Detecting conservation: Detecting conservation: Evolutionary rates d Rate (~speed) = distance / time Distance = number of substitutions per site Time = 2*#years (doubled because the sequences evolved independently

11 Mean Rate of Nucleotide Substitution in Mammalian Genomes Evolution is a very slow process at the molecular level (“Nothing happens…”) ~10 -9 Substitutions/site/year

12 Rate computation HumanDMAAHAM ChimpDEAAGGC CowDQAAWAP FishDLAACAL S. cerevisiae DDGAFAA S. pombe DDGALGE

13 Site-specific rate computation method

14 Using the ConSeq server

15 ConSeq results:

16 Crash course in protein structure

17 Why protein structure?  Each protein has a particular 3D structure that determines its function  Protein structure is better conserved than protein sequence and more closely related to function  Analyzing a protein structure is more informative than analyzing its sequence for function inference

18   Holds 3D models of biological macromolecules (protein, RNA, DNA, small molecules)   All data are available to the public   X-Ray crystals (84%) NMR models (16%)   Submitted by biologists and biochemists from around the world PDB: Protein Data Bank

19 PDB model  Defines the 3D coordinates (x,y,z) of each of the atoms in one or more molecules (i.e., complex)  There are models of proteins, protein complexes, proteins and DNA, protein segments, etc …  The models also include the positions of ligand molecules, solvent molecules, metal ions, etc…  PDB code: integer + 3 integers/characters (e.g., 1a14)

20 The PDB file – text format

21 The PDB file – textformat The PDB file – text format ATOM: Usually protein or DNA HETATM: Usually Ligand, ion, water chain Residue identity Residue number Atom number Atom identity The coordinates for each residue in the structure Temperature factor XYZ

22 Viewing structures WireframeSpacefill Backbone

23 Protein core: structurally constrained - usually conserved Active site: functionally constrained - usually conserved Surface loops: tolerant to mutations - usually variable Hydrophobic core Surface loops Conservation in the structure Active site

24 Same algorithm as ConSeq, but here the results are projected onto the 3D structure of the protein

25 Using the ConSurf server

26 ConSurf example: potassium channel  An integral membrane protein with sequence similarity to all known K+ channels, particularly in the pore region  PDB code: 1bl8, chain A

27 ConSurf results: results

28  Alignment of homologs found by psi-blast: ConSurf example: potassium channel

29 ConSurf results: results

30 ConSurf example: potassium channel  Neighbor-Joining reconstructed phylogenetic tree:

31 ConSurf results: results

32 Conservation scores:  The scores are standardized: the average score for all residues is zero, and the standard deviation is one  The lowest score represents the most conserved site in the protein negative values: slowly evolving (= low evolutionary rate), conserved sites negative values: slowly evolving (= low evolutionary rate), conserved sites  The highest score represents the most variable site in the protein positive values: rapidly evolving (= fast evolutionary rate), variable sites positive values: rapidly evolving (= fast evolutionary rate), variable sites

33 ConSurf results: amino-acid conservation scores

34 ConSurf result with First Glance in Jmol:

35 ConSeq/ConSurf user intervention (advanced options) ConSeq/ConSurf user intervention (advanced options) 1. Method of calculating the amino acid conservation scores: Bayesian/Max Likelihood 2. Enter your own MSA file 3. Multiply Align Sequences using: MUSCLE/CLUSTALW 4. Collect the Homologues from: SWISS-PROT/UniProt 5. Max. Number of Homologues (default = 50) 6. No. of PSI-BLAST Iterations (default = 1) 7. PSI-BLAST E-value Cutoff (default = ) 8. Model of substitution for proteins: JTT/Dayhoff/mtREV/cpREV/WAG 9. Enter your own PDB file 10. Enter your own TREE file

36 Codon-level selection  ConSeq/ConSurf: Compute the evolutionary rate of amino-acid sites → the data are amino acids. Compute the evolutionary rate of amino-acid sites → the data are amino acids.  But, codons encode amino acids…  61 codons vs. 20 amino acids !  Aren’t we loosing information ???

37 Darwin – the theory of natural selection  Adaptive evolution: Favorable traits will become more frequent in the population

38 M. Kimura – the neutral theory of molecular evolution Most of the DNA variation between species is neutral with regards to the phenotype Selection operates to preserve a trait

39 Synonymous (silent) and non- synonymous (non-silent) substitutions Silent Non-silent …

40 Synonymous vs. nonsynonymous substitutions UUU → UUC (Phe → Phe ): synonymous UUU → CUU (Phe → Leu): non-synonymous synonymous substitutions = silent substitutions non-synonymous substitutions = non-silent or amino-acid altering substitutions

41 For most proteins, it is observed that the rate of synonymous substitutions is much Higher than the non-synonymous rate purifying selection This is called purifying selection (= conservation this is what ConSeq/Surf are computing ) Synonymous vs. non-synonymous substitutions

42 Synonymous vs. non-synonymous substitutions Structural proteins

43 Saturation of synonymous substitutions Histone H4 between human and wheat: saturation of synonymous substitutions

44 There are rare cases where the non- synonymous rate is much higher than the synonymous rate Positive selection This is called Positive selection Synonymous vs. nonsynonymous substitutions

45 Examples:  Proteins of the immune system  Pathogen proteins evading the host immune system  Pathogen proteins that are drug targets  Proteins that are products of gene duplication  Proteins involved in the reproduction system Positive Selection The hypothesis: Promotes the fitness of the organism

46 Computing synonymous and non- synonymous rates Codon-based MSA: translate DNA to amino acids, align, backtrack to the DNA but keep alignment structure 5 replacements in 10 positions between human and chimp is a lot, but between human and cucumber is nothing Phylogenetic tree: 5 replacements in 10 positions between human and chimp is a lot, but between human and cucumber is nothing Different replacement probabilities between two amino acids: Lys  Arg ≠ Lys  Cys Positive selection occurs at only a few sites !

47 Inferring positive selection Divide the rate of non-silent substitutions (K a ) by the rate of silent substitutions (K s )

48 Inferring positive selection Basic assumptions: Selection score (Ka/Ks) > 1 ↓ positive selection Selection score (Ka/Ks) < 1 ↓ purifying selection Selection score (Ka/Ks) ~= 1 ↓ neutral selection

49 Not so fast !!!  We explicitly assume there is positive selection in the data  There is a good chance our model will find a few positively selected sites whatever the case  Is this really indicative of positive selection or plain randomness? So, maybe there’s no positive selection after all

50 Statistics helps us to compare between hypotheses  H 0 : There’s no positive selection  H 1 : There is positive selection  H 0 : compute the probability (likelihood) of the data using a model that does not account for positive selection  H 1 : compute the probability (likelihood) of the data using a model that does account for positive selection  Perform a likelihood ratio test (LRT)

51

52 Using the selecton server

53 Input = a coding sequence at the codon level  The user must provide the sequences – no psi-blast option  The sequences’ lengths must divide by 3 (ORF) and must not include any stop-codons  An alignment should be a codon alignment RevTrans RevTrans

54 Similar to ConSurf optional Nuclear/mitochondria different species Default run: M8(H1) and the M8a(H0)

55 Selecton Example: HIV Protease The Protease is an essential enzyme for viral infectivity PDB ID: 1hxw

56 Selecton Results:

57 Selecton Results:

58 Selecton results:

59 Selection scores (Ka/Ks):  The scores are normalized  Ka/Ks > 1: positive selected site  Ka/Ks <1: purified selected site

60 Coloring scheme:  Used for visualization and is based on the continuous Ka/Ks scores.  The color grades (1-7): 1 for sites under strong positive selection (yellow) 1 for sites under strong positive selection (yellow) 7 for sites under strong purifying selection (bordeaux) 7 for sites under strong purifying selection (bordeaux) Color coding scheme of Selecton

61