Identification of specificity-determining positions in protein alignments Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems, RAS ECCB2005, Madrid
Motivation Large protein families with general function assigned by homology, not much functional information Much less structural data. Not many structures with substrates, cofactors etc. Some specificity assignments from comparative genomics => Search for specificity-determining positions in alignments –identification of functional sites –prediction of specificity –understanding and eventually re-design of function
Specificity (of transporters) from comparative genomics – three examples. 1. New specificities in a little studied family S-box (rectangle frame) MetJ (circle frame) LYS-element (circles) Tyr-T-box (rectangles) malate/lactate
2. Misleading homology: The PnuC family of transporters The RFN elements The THI elements
3. A nightmare. The NiCoT family of nickel- cobalt transporters
SDP (Specificity-Determining Position) Alignment position that is conserved within groups of proteins having the same specificity (specificity groups) but differs between them SDP is not equivalent to a functionally important position
Measure of specificity: mutual information =count of amino acid α in group i at position p divided by the total number of sequences =frequency of amino acid α in position p =fraction of proteins in group i
Taking into account the structure of the phylogenetic tree: random shuffling and linear regression Z-score min linear regression => positions that are more specific than expected given the tree
Smoothing: pseudocounts and similarity between amino acid residues m(a b) = amino acid substitution matrix n(a,i) = count of amino acid a at position i
Automated threshold setting: the Bernoulli estimator Are 5 SDP with Z-score > 12 better than 10 SDP with Z-score > 9?
Other similar techniques Evolutionary trace (Lichtarge et al. 1996, 1997) – need structure; gradual construction of group-specific consensus Evolutionary rate shifts (DIVERGE, Gu et al. 2002) – positions with group-specific evolutionary rate Surface patches of slowly evolving residues (Rate4Site, Pupko et al. 2002) – need structure PCA in the sequence space (Casari et al., 1995) Correlated mutations (Pazos and Valencia, 2002) Prediction of functional sub-types (Hannenhalli and Russell, 2000) – relative entropy of HMM profiles for groups
SDPpred: Web interface Input: multiple alignment of proteins divided into specificity groups === AQP === %sp|Q9L772|AQPZ_BRUME mlnklsaeffgtfwlvfggcgsa ilaa--afp elgigflgvalafgltvltmayavggisg--ghfnpavslgltv iiilgsts slap qlwlfwvaplvgavigaiiwkgllgrd %sp|P48838|AQPZ_ECOLI mfrklaaecfgtfwlvfggcgsa vlaa--gfp elgigfagvalafgltvltmafavghisg--ghfnpavtiglwa lvihgatd kfap qlwffwvvpivggiiggliyrtllekrd %tr|Q92ZW mfkklcaeflgtcwlvlggcgsa vlas--afp qvgigllgvsfafgltvltmaytvggisg--ghfnpavslglav iiilgsth rrvp qlwlfwiaplfgaaiagivwksvgeefrpvd === GLP === %sp|P11244|GLPF_ECOLI msqt---stlkgqciaeflgtglliffgvgcv aalkvag a-sfgqweisviwglgvamaiyltagvsg--ahlnpavtialwl glilaltd dgn g-vpr -flvplfgpivgaivgafayrkligrhlpcdicvveek--etttpseqkasl %sp|P44826|GLPF_HAEIN mdks-----lkancigeflgtalliffgvgcv …
SDPpred: Output Alignment of the family with the SDPs highlighted (Alignment view) Detailed description of each SDP (List of SDPs) Plot of probabilities used by the Bernoulli estimator to set the cutoff (Probability plot view)
Transcription factors from the LacI family Training set: 459 sequences, average length: 338 amino acids, 85 specificity groups 10 residues contact NPF (analog of the effector) 6 residues in the intersubunit contacts 7 residues contact the operator sequence 7 residues in the effector contact zone (5Ǻ<d min <10Ǻ) 5 residues in the intersubunit contact zone (5Ǻ<d min <10Ǻ) 6 residues in the operator contact zone (5Ǻ<d min <10Ǻ) – 44 SDPs LacI from E.coli
SDP clusters at the subunit contact region LacI (lactose repressor) from E.coli (1jwl) Effector DNA operator Cluster I Cluster II
Overall statistics (LacI of E. coli) Total 348 amino acids 44 SDP Non-contacting residues (distance to the DNA, effector, or the other subunit >10Ǻ) Contact zone (may be functional) Contacting residues (distance to the DNA, effector, or the other subunit <5Ǻ)
Membrane channels of the MIP family Training set: 17 sequences, average length 280 amino acids, 2 specificity groups: Aquaporines & glyceroaquaporines – 21 SDPs 8 residues contact glycerol (substrate) (d min <5Ǻ) 8 residues oriented to the channel 5 residues in the contacts with other subunits GlpF from E.coli
Glpf (glycerol facilitator) from E. coli (1fx8) Cluster I Cluster II Subunit I Substrate(glycerol) Two SDP clusters at the contact of subunits forming the tetramer 20Leu, 24Ile, 108Tyr of one subunit, 193Ser of another subunit Glu43
Overall statistics (GlpF from E.coli) Total 281 amino acids 21 SDP Contacting residues (distance to the substrate, or another subunit <5Ǻ) Non-contacting residues (distance to the substrate, or another subunit >10Ǻ) Contact zone (may be functional)
isocitrate/isopropylmalate dehydrogenases : combinations of specificities towards substrate and cofactor IDH: catalyzes the oxidation of isocitrate to α-ketoglutorate and CO 2 (TCA) using either NAD or NADP as a cofactor in organisms from prokaryotes to higher eukaryotes IMDH: catalyzes oxidative decarboxylation of 3- isopropylmalate into 2-oxo-4- methylvalerate (leucine biosynthesis) in prokaryotes and fungi, the cofactor is NAD Mitochondria ArchaeaBacteria Eukaryota ArchaeaBacteriaEukaryota
Selecting specificity groups 1. By substrate: all IDHs vs. all IMDHs 3. Four groups IDH (NAD) IDH (NADP) type II IDH (NADP) type II IMDH (NAD) IDH (NADP) type I IDH (NADP) type I IDH (NAD) IDH (NADP) type II IMDH (NAD) IDH (NADP) type I 2. By cofactor: all NAD- dependent vs. all NADP-dependent
Predicted SDPs most SDPs near the substrate SDPs near the substrate and the cofactor SDPs near the substrate, the cofactor and the other subunit
SDPs, the cofactor and the substrate Substrate (isocitrate) Cofactor (NADP) Nicotinamide nucleotide Adenine nucleotide 344Lys, 345Tyr, 351Val: cofactor-specific SDPs, known determinants of specificity to cofactor 100Lys, 104Thr, 105Thr, 107Val, 337Ala, 341Thr: substrate-specific and four group SDPs, functionally not characterized NADP-dependent IDH from E. coli (1ai2)
SDPs predicted for different groupings cofactor- specific SDPs substrate- specific SDPs Four groups 154Glu 158Asp 208Arg 229His 231Gly 233Ile 287Gln 300Ala 305Asn 308Tyr 327Asn 344Lys 345Tyr351Val 38Gly40Asp 100Lys 103Leu 105Thr 115Asn 155Asn 164Glu 241Phe 337Ala 341Thr 97Val 98Ala 104Thr 107Val152Phe 161Ala 162Gly 232Asn245Gly 31Tyr 323Ala 36Gly 45Met Color code: Contacts cofactor Contacts substrate AND cofactor Contacts substrate Contacts substrate AND the other subunit Contacts the other subunit
Overview Transcription factors: contacts with the cofactor and the DNA Transporters: contacts with the substrate Enzymes: contacts with the substrate and the cofactor And all: contacts between subunits
Protein-DNA interactions CRPPurR IHFTrpR Entropy at aligned sites (blue plots) and the number of contacts (red: heavy atoms in a base pair at a distance <cutoff from a protein atom)
The observed correlation does not depend on the distance cutoff
CRP/FNR family of regulators
Correlation between contacting nucleotides and amino acid residues CooA in Desulfovibrio spp. CRP in Gamma-proteobacteria HcpR in Desulfovibrio spp. FNR in Gamma-proteobacteria DD COOA ALTTEQLSLHMGATRQTVSTLLNNLVR DV COOA ELTMEQLAGLVGTTRQTASTLLNDMIR EC CRP KITRQEIGQIVGCSRETVGRILKMLED YP CRP KXTRQEIGQIVGCSRETVGRILKMLED VC CRP KITRQEIGQIVGCSRETVGRILKMLEE DD HCPR DVSKSLLAGVLGTARETLSRALAKLVE DV HCPR DVTKGLLAGLLGTARETLSRCLSRMVE EC FNR TMTRGDIGNYLGLTVETISRLLGRFQK YP FNR TMTRGDIGNYLGLTVETISRLLGRFQK VC FNR TMTRGDIGNYLGLTVETISRLLGRFQK TGTCGGCnnGCCGACA TTGTgAnnnnnnTcACAA TTGTGAnnnnnnTCACAA TTGATnnnnATCAA Contacting residues: REnnnR TG: 1 st arginine GA: glutamate and 2 nd arginine
The correlation holds for other factors in the family
Plans and perspectives. Protein-DNA interactions LacI family of transcriptional regulators (each branch represents a subfamily)
… and their signals 1605 regulators from 189 genomes, forming 302 groups of orthologs and binding 2518 sites
Plans and perspectives. Experimental verification A new family of Ni/Co transporters No structural data Specificity predicted by comparative genomics Predicted SDPs form several clusters in the alignment, are located on the same sides of alpha-helices Mutational analysis
Terminators of translation in prokaryotes / decoding of stop-codons. Specificity of RF1 (UAG, UAA) and RF2 (UGA, UAA) Fragment of the alignment (117 pairs). SDPs are shown by black boxes above the alignment.
“Interesting” positions: invariant, SDPs, variable rate.
SDPs and invariant positions: two decoding sites?
Plans and perspectives Use of 3D structures, when available. Identification of functional sites as spatial clusters of SDPs and conserved positions Automated identification of specificity groups based on the analysis of the phylogenetic tree Protein-DNA interactions Identification of protein-protein contact surfaces
Publications N.J.Oparina, O.V.Kalinina, M.S.Gelfand, L.L.Kisselev (2005) Common and specific amino acid residues in the prokaryotic polypeptide release factors RF1 and RF2: possible functional implications. Nucleic Acids Research 33 (in press). O.V.Kalinina, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Science 13: O.V.Kalinina, P.S.Novichkov, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Research 32: W424-W428. O.V.Kalinina, M.S.Gelfand, A.A.Mironov, A.B.Rakhmaninova (2003) Amino acid residues forming specific contacts between subunits in tetramers of the membrane channel GlpF. Biophysics (Moscow) 48: S141-S145. L.A.Mirny, M.S.Gelfand (2002) Using orthologous and paralogous proteins to identify specificity determining residues in bacterial transcription factors. Journal of Molecular Biology 321: L.Mirny, M.S.Gelfand (2002) Structural analysis of conserved base-pairs in protein-DNA complexes. Nucleic Acids Research 30:
Acknowledgements Leonid Mirny (Harvard, MIT) Olga Kalinina Andrei A. Mironov Alexandra B. Rakhmaninova Dmitry Rodionov Olga Laikova Howard Hughes Medical Institute Ludwig Institute of Cancer Research Russian Fund of Basic Research Russian Academy of Sciences, programs “Molecular and Cellular Biology” and “Origin and Evolution of the Biosphere”