Identification of specificity-determining positions in protein alignments Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information.

Slides:



Advertisements
Similar presentations
TEMPLATE DESIGN © Statistical Coupling Analysis of the Photosystem II D1 Protein Janan Zhu 1 ; Nicholas Polizzi 2 ; 1.
Advertisements

Molecular Genetics PaCES Summer Program in Environmental Science.
Review: Amino Acid Side Chains Aliphatic- Ala, Val, Leu, Ile, Gly Polar- Ser, Thr, Cys, Met, [Tyr, Trp] Acidic (and conjugate amide)- Asp, Asn, Glu, Gln.
FUNDAMENTALS OF MOLECULAR BIOLOGY Introduction -Molecular Biology, Cell, Molecule, Chemical Bonding Macromolecule -Class -Chemical structure -Forms Important.
• Exam II Tuesday 5/10 – Bring a scantron with you!
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
Sequence analysis June 20, 2006 Learning objectives-Understand sliding window programs. Understand difference between identity, similarity and homology.
1 Levels of Protein Structure Primary to Quaternary Structure.
Lectures on Computational Biology HC Lee Computational Biology Lab Center for Complex Systems & Biophysics National Central University EFSS II National.
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Introduction to Bioinformatics Algorithms Sequence Alignment.
It og Sundhed Thomas Nordahl Petersen, Associate Professor Center for Biological Sequence Analysis, DTU
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Thomas Blicher Center for Biological Sequence Analysis
It & Health 2009 Summary Thomas Nordahl Petersen.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Molecular Techniques in Molecular Systematics. DNA-DNA hybridisation -Measures the degree of genetic similarity between pools of DNA sequences. -Normally.
Introduction to bioinformatics
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Proteins. The central role of proteins in the chemistry of life Proteins have a variety of functions. Structural proteins make up the physical structure.
Protein Synthesis Notes
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
How does DNA work? What is a gene?
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
How Proteins Are Made Mrs. Wolfe. DNA: instructions for making proteins Proteins are built by the cell according to your DNA What kinds of proteins are.
On the nature of cavities on protein surfaces: Application to the Identification of drug-binding sites Murad Nayal, Barry Honig Columbia University, NY.
BIOCHEMISTRY REVIEW Overview of Biomolecules Chapter 4 Protein Sequence.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Mount Mary College Students: Jessica Benson, Amy Ramirez, Nerissa Seward Faculty Advisor: Dr. Colleen Conway Medical College of Wisconsin Research Mentor:
Today Building a genome –Nucleotides, GC content and isochores –Gene structure and expression; introns –Evolution of noncoding RNAs Evolution of transcription.
Secondary structure prediction
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
Fig Second mRNA base First mRNA base (5 end of codon) Third mRNA base (3 end of codon)
Construction of Substitution Matrices
AP Biology Warmup 11/12 Differentiate a codon and an anitcodon. Which do you use to read the following chart?
A program of ITEST (Information Technology Experiences for Students and Teachers) funded by the National Science Foundation Background Session #3 DNA &
RNA 2 Translation.
1 Protein synthesis How a nucleotide sequence is translated into amino acids.
SDPpred: a method for identification of amino acid residues that determine differences in functional specificity of homologous proteins and application.
Construction of Substitution matrices
Evolution of bacterial regulatory systems Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems.
Regents Biology From gene to protein: transcription translation protein.
Protein Sequence Alignment Multiple Sequence Alignment
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
Ribonucleotide reductases (RNRs) catalyse the reduction of ribonucleotides to their corresponding 2`-deoxyribonucleotides and therefore play an essential.
Chapter 17 How to read a table of codons. These are two forms in which you might see a table of codons.
Intersubunit contacts are often facilitated by specificity-determining positions Computational identification of protein positions that possibly account.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
NUCLEIC ACIDS AND PROTEIN SYNTHESIS. DNA complex molecule contains the complete blueprint for every cell in every living thing Amount of DNA that would.
Arginine, who are you? Why so important?. Release 2015_01 of 07-Jan-15 of UniProtKB/Swiss-Prot contains sequence entries, comprising
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
from nucleic acid language to amino acid language
Protein Sequence Alignments
TRANSLATION Protein Synthesis
Structure, Exchange Determinants, and Family-Wide Rab Specificity of the Tandem Helical Bundle and Vps9 Domains of Rabex-5  Anna Delprato, Eric Merithew,
Investigating Lipid Composition Effects on the Mechanosensitive Channel of Large Conductance (MscL) Using Molecular Dynamics Simulations  Donald E. Elmore,
Structure of an IκBα/NF-κB Complex
Introduction to Bioinformatics II
Presentation transcript:

Identification of specificity-determining positions in protein alignments Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems, RAS ECCB2005, Madrid

Motivation Large protein families with general function assigned by homology, not much functional information Much less structural data. Not many structures with substrates, cofactors etc. Some specificity assignments from comparative genomics => Search for specificity-determining positions in alignments –identification of functional sites –prediction of specificity –understanding and eventually re-design of function

Specificity (of transporters) from comparative genomics – three examples. 1. New specificities in a little studied family S-box (rectangle frame) MetJ (circle frame) LYS-element (circles) Tyr-T-box (rectangles) malate/lactate

2. Misleading homology: The PnuC family of transporters The RFN elements The THI elements

3. A nightmare. The NiCoT family of nickel- cobalt transporters

SDP (Specificity-Determining Position) Alignment position that is conserved within groups of proteins having the same specificity (specificity groups) but differs between them SDP is not equivalent to a functionally important position

Measure of specificity: mutual information =count of amino acid α in group i at position p divided by the total number of sequences =frequency of amino acid α in position p =fraction of proteins in group i

Taking into account the structure of the phylogenetic tree: random shuffling and linear regression Z-score  min linear regression => positions that are more specific than expected given the tree

Smoothing: pseudocounts and similarity between amino acid residues m(a  b) = amino acid substitution matrix n(a,i) = count of amino acid a at position i

Automated threshold setting: the Bernoulli estimator Are 5 SDP with Z-score > 12 better than 10 SDP with Z-score > 9? 

Other similar techniques Evolutionary trace (Lichtarge et al. 1996, 1997) – need structure; gradual construction of group-specific consensus Evolutionary rate shifts (DIVERGE, Gu et al. 2002) – positions with group-specific evolutionary rate Surface patches of slowly evolving residues (Rate4Site, Pupko et al. 2002) – need structure PCA in the sequence space (Casari et al., 1995) Correlated mutations (Pazos and Valencia, 2002) Prediction of functional sub-types (Hannenhalli and Russell, 2000) – relative entropy of HMM profiles for groups

SDPpred: Web interface Input: multiple alignment of proteins divided into specificity groups === AQP === %sp|Q9L772|AQPZ_BRUME mlnklsaeffgtfwlvfggcgsa ilaa--afp elgigflgvalafgltvltmayavggisg--ghfnpavslgltv iiilgsts slap qlwlfwvaplvgavigaiiwkgllgrd %sp|P48838|AQPZ_ECOLI mfrklaaecfgtfwlvfggcgsa vlaa--gfp elgigfagvalafgltvltmafavghisg--ghfnpavtiglwa lvihgatd kfap qlwffwvvpivggiiggliyrtllekrd %tr|Q92ZW mfkklcaeflgtcwlvlggcgsa vlas--afp qvgigllgvsfafgltvltmaytvggisg--ghfnpavslglav iiilgsth rrvp qlwlfwiaplfgaaiagivwksvgeefrpvd === GLP === %sp|P11244|GLPF_ECOLI msqt---stlkgqciaeflgtglliffgvgcv aalkvag a-sfgqweisviwglgvamaiyltagvsg--ahlnpavtialwl glilaltd dgn g-vpr -flvplfgpivgaivgafayrkligrhlpcdicvveek--etttpseqkasl %sp|P44826|GLPF_HAEIN mdks-----lkancigeflgtalliffgvgcv …

SDPpred: Output Alignment of the family with the SDPs highlighted (Alignment view) Detailed description of each SDP (List of SDPs) Plot of probabilities used by the Bernoulli estimator to set the cutoff (Probability plot view)

Transcription factors from the LacI family Training set: 459 sequences, average length: 338 amino acids, 85 specificity groups 10 residues contact NPF (analog of the effector) 6 residues in the intersubunit contacts 7 residues contact the operator sequence 7 residues in the effector contact zone (5Ǻ<d min <10Ǻ) 5 residues in the intersubunit contact zone (5Ǻ<d min <10Ǻ) 6 residues in the operator contact zone (5Ǻ<d min <10Ǻ) – 44 SDPs LacI from E.coli

SDP clusters at the subunit contact region LacI (lactose repressor) from E.coli (1jwl) Effector DNA operator Cluster I Cluster II

Overall statistics (LacI of E. coli) Total 348 amino acids 44 SDP Non-contacting residues (distance to the DNA, effector, or the other subunit >10Ǻ) Contact zone (may be functional) Contacting residues (distance to the DNA, effector, or the other subunit <5Ǻ)

Membrane channels of the MIP family Training set: 17 sequences, average length 280 amino acids, 2 specificity groups: Aquaporines & glyceroaquaporines – 21 SDPs 8 residues contact glycerol (substrate) (d min <5Ǻ) 8 residues oriented to the channel 5 residues in the contacts with other subunits GlpF from E.coli

Glpf (glycerol facilitator) from E. coli (1fx8) Cluster I Cluster II Subunit I Substrate(glycerol) Two SDP clusters at the contact of subunits forming the tetramer 20Leu, 24Ile, 108Tyr of one subunit, 193Ser of another subunit Glu43

Overall statistics (GlpF from E.coli) Total 281 amino acids 21 SDP Contacting residues (distance to the substrate, or another subunit <5Ǻ) Non-contacting residues (distance to the substrate, or another subunit >10Ǻ) Contact zone (may be functional)

isocitrate/isopropylmalate dehydrogenases : combinations of specificities towards substrate and cofactor IDH: catalyzes the oxidation of isocitrate to α-ketoglutorate and CO 2 (TCA) using either NAD or NADP as a cofactor in organisms from prokaryotes to higher eukaryotes IMDH: catalyzes oxidative decarboxylation of 3- isopropylmalate into 2-oxo-4- methylvalerate (leucine biosynthesis) in prokaryotes and fungi, the cofactor is NAD Mitochondria ArchaeaBacteria Eukaryota ArchaeaBacteriaEukaryota

Selecting specificity groups 1. By substrate: all IDHs vs. all IMDHs 3. Four groups IDH (NAD) IDH (NADP) type II IDH (NADP) type II IMDH (NAD) IDH (NADP) type I IDH (NADP) type I IDH (NAD) IDH (NADP) type II IMDH (NAD) IDH (NADP) type I 2. By cofactor: all NAD- dependent vs. all NADP-dependent

Predicted SDPs most SDPs near the substrate SDPs near the substrate and the cofactor SDPs near the substrate, the cofactor and the other subunit

SDPs, the cofactor and the substrate Substrate (isocitrate) Cofactor (NADP) Nicotinamide nucleotide Adenine nucleotide 344Lys, 345Tyr, 351Val: cofactor-specific SDPs, known determinants of specificity to cofactor 100Lys, 104Thr, 105Thr, 107Val, 337Ala, 341Thr: substrate-specific and four group SDPs, functionally not characterized NADP-dependent IDH from E. coli (1ai2)

SDPs predicted for different groupings cofactor- specific SDPs substrate- specific SDPs Four groups 154Glu 158Asp 208Arg 229His 231Gly 233Ile 287Gln 300Ala 305Asn 308Tyr 327Asn 344Lys 345Tyr351Val 38Gly40Asp 100Lys 103Leu 105Thr 115Asn 155Asn 164Glu 241Phe 337Ala 341Thr 97Val 98Ala 104Thr 107Val152Phe 161Ala 162Gly 232Asn245Gly 31Tyr 323Ala 36Gly 45Met Color code: Contacts cofactor Contacts substrate AND cofactor Contacts substrate Contacts substrate AND the other subunit Contacts the other subunit

Overview Transcription factors: contacts with the cofactor and the DNA Transporters: contacts with the substrate Enzymes: contacts with the substrate and the cofactor And all: contacts between subunits

Protein-DNA interactions CRPPurR IHFTrpR Entropy at aligned sites (blue plots) and the number of contacts (red: heavy atoms in a base pair at a distance <cutoff from a protein atom)

The observed correlation does not depend on the distance cutoff

CRP/FNR family of regulators

Correlation between contacting nucleotides and amino acid residues CooA in Desulfovibrio spp. CRP in Gamma-proteobacteria HcpR in Desulfovibrio spp. FNR in Gamma-proteobacteria DD COOA ALTTEQLSLHMGATRQTVSTLLNNLVR DV COOA ELTMEQLAGLVGTTRQTASTLLNDMIR EC CRP KITRQEIGQIVGCSRETVGRILKMLED YP CRP KXTRQEIGQIVGCSRETVGRILKMLED VC CRP KITRQEIGQIVGCSRETVGRILKMLEE DD HCPR DVSKSLLAGVLGTARETLSRALAKLVE DV HCPR DVTKGLLAGLLGTARETLSRCLSRMVE EC FNR TMTRGDIGNYLGLTVETISRLLGRFQK YP FNR TMTRGDIGNYLGLTVETISRLLGRFQK VC FNR TMTRGDIGNYLGLTVETISRLLGRFQK TGTCGGCnnGCCGACA TTGTgAnnnnnnTcACAA TTGTGAnnnnnnTCACAA TTGATnnnnATCAA Contacting residues: REnnnR TG: 1 st arginine GA: glutamate and 2 nd arginine

The correlation holds for other factors in the family

Plans and perspectives. Protein-DNA interactions LacI family of transcriptional regulators (each branch represents a subfamily)

… and their signals 1605 regulators from 189 genomes, forming 302 groups of orthologs and binding 2518 sites

Plans and perspectives. Experimental verification A new family of Ni/Co transporters No structural data Specificity predicted by comparative genomics Predicted SDPs form several clusters in the alignment, are located on the same sides of alpha-helices Mutational analysis

Terminators of translation in prokaryotes / decoding of stop-codons. Specificity of RF1 (UAG, UAA) and RF2 (UGA, UAA) Fragment of the alignment (117 pairs). SDPs are shown by black boxes above the alignment.

“Interesting” positions: invariant, SDPs, variable rate.

SDPs and invariant positions: two decoding sites?

Plans and perspectives Use of 3D structures, when available. Identification of functional sites as spatial clusters of SDPs and conserved positions Automated identification of specificity groups based on the analysis of the phylogenetic tree Protein-DNA interactions Identification of protein-protein contact surfaces

Publications N.J.Oparina, O.V.Kalinina, M.S.Gelfand, L.L.Kisselev (2005) Common and specific amino acid residues in the prokaryotic polypeptide release factors RF1 and RF2: possible functional implications. Nucleic Acids Research 33 (in press). O.V.Kalinina, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Science 13: O.V.Kalinina, P.S.Novichkov, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Research 32: W424-W428. O.V.Kalinina, M.S.Gelfand, A.A.Mironov, A.B.Rakhmaninova (2003) Amino acid residues forming specific contacts between subunits in tetramers of the membrane channel GlpF. Biophysics (Moscow) 48: S141-S145. L.A.Mirny, M.S.Gelfand (2002) Using orthologous and paralogous proteins to identify specificity determining residues in bacterial transcription factors. Journal of Molecular Biology 321: L.Mirny, M.S.Gelfand (2002) Structural analysis of conserved base-pairs in protein-DNA complexes. Nucleic Acids Research 30:

Acknowledgements Leonid Mirny (Harvard, MIT) Olga Kalinina Andrei A. Mironov Alexandra B. Rakhmaninova Dmitry Rodionov Olga Laikova Howard Hughes Medical Institute Ludwig Institute of Cancer Research Russian Fund of Basic Research Russian Academy of Sciences, programs “Molecular and Cellular Biology” and “Origin and Evolution of the Biosphere”