Identification of protein-protein binding motifs

Identification of protein-protein binding motifs
Felipe Leal Valentim Aalt-Jan van Dijk Plant Research International Applied Bioinformatics

Protein-protein binding interfaces

Protein-protein binding interfaces
Surface Surface Ligand binding site Core structural residues Core Core DNA-binding site Properties: Exposed in the protein surface; Functionally/Structurally important residues are more highly conserved; So, I’ll briefly give an overview about what we need to know about the protein-protein binding interfaces and helped us to design our prediction pipeline. When we analyze a pair of interacting proteins, the shared region where they interact is called the interaction interface; and has been shown to have an important role in determining the interaction specificity and affinity. In addition, some bioinformatics researches have delivered some information that may be used to identify the interface: 1 – First, that the interface residues controlling the interaction specificity and affinity are located in the surface; Well, this is a particularly important information here because we can restrict the region in the protein where we are going to search for the interface; if we can distinguish which residues are exposed in the surface, from those buried in the core of the structure. 2 – Second, that functionally and structurally important residues are more highly conserved than the rest of the protein. Well, there is evolutionary pressure to maintain to proper folding of the proteins, and to maintain its function. The residues responsible to maintain the proper folding are thought to be located in the core of the structure; while the functional residues are exposed in the surface. Then, it would be easy to locate the binding interface only searching for conserved regions exposed in the surface; if it weren't for a compliant factor: in the surface, besides of the interface, we find others highly conserved functional residues, like the DNA binding sites and small ligands binding sites. Although we can distinguish functional and non-functional regions only looking the evolutionary conservation; conservation alone is not enough to distinguish a protein-protein binding site from, for example, DNA-binding site. 3 – In addition, systematic analysis of structures and sequences of protein complexes have shown that the interface has a slightly distinct amino acid composition; and it has a typical hydrophobic landscape. Although these two are very subtle signals, I have read about some attempts to predict the protein binding sites using these information. In my case, I focused in the two first properties.

Changing the specificity of the protein interaction
[van Dijk AD et al., PLoS Comput Biol. 2010] - Sequence Motifs in MADS Transcription Factors Responsible for Specificity and Diversification of Protein-Protein Interaction So, I’ll briefly give an overview about what we need to know about the protein-protein binding interfaces and helped us to design our prediction pipeline. When we analyze a pair of interacting proteins, the shared region where they interact is called the interaction interface; and has been shown to have an important role in determining the interaction specificity and affinity. In addition, some bioinformatics researches have delivered some information that may be used to identify the interface: 1 – First, that the interface residues controlling the interaction specificity and affinity are located in the surface; Well, this is a particularly important information here because we can restrict the region in the protein where we are going to search for the interface; if we can distinguish which residues are exposed in the surface, from those buried in the core of the structure. 2 – Second, that functionally and structurally important residues are more highly conserved than the rest of the protein. Well, there is evolutionary pressure to maintain to proper folding of the proteins, and to maintain its function. The residues responsible to maintain the proper folding are thought to be located in the core of the structure; while the functional residues are exposed in the surface. Then, it would be easy to locate the binding interface only searching for conserved regions exposed in the surface; if it weren't for a compliant factor: in the surface, besides of the interface, we find others highly conserved functional residues, like the DNA binding sites and small ligands binding sites. Although we can distinguish functional and non-functional regions only looking the evolutionary conservation; conservation alone is not enough to distinguish a protein-protein binding site from, for example, DNA-binding site. 3 – In addition, systematic analysis of structures and sequences of protein complexes have shown that the interface has a slightly distinct amino acid composition; and it has a typical hydrophobic landscape. Although these two are very subtle signals, I have read about some attempts to predict the protein binding sites using these information. In my case, I focused in the two first properties.

Protein-protein binding motifs
Interface So, continuing with this introduction; when we look at the structure of a protein complex we see the interface as a continuous stretch of amino acids. However, when we locate the same residues in the protein sequences; we find that the interface is composed by scattered short sequences. For instance, this figure shows two interacting proteins, in red and green, where the interface residues are represented as spheres. In this other figure, the sequences of these proteins are show, and the sequence of this very same interface is highlighted here in red, and here in green. These are exactly what I call protein-protein binding motifs and are what we tried to predict in this work. These binding motifs, are usually thought to be overrepresented in pairs of interacting proteins; which means that when we look at protein-protein interaction networks, overall, we find a scenario similar to this example. Here, every interacting partner of the blue protein contains a grey motif; and every interacting partner of the red protein, contains a grey motif. Well, there are indeed cases where interacting partners don’t share any sequence similarity, but large protein-protein interaction networks should enable us to detect even small signals of sequence overrepresentation.

Protein-protein binding motifs
Protein binding interfaces are composed by residues highly conserved and exposed in the surface; The interface can be represented by short sequence motifs; which are thought to be overrepresented in pairs of interacting proteins. The main message of this introductory slides is that: Protein binding interfaces are composed by residues highly conserved and exposed in the surface; and moreover; The interface can be represented by short sequence motifs; which are thought to be overrepresented in pairs of interacting proteins;

Identification binding interfaces from structures
Arabidopsis Histidine Kinase4 Arabidopsis Trans Zeatin Protein 1 Binding interface Protein 2 Binding interface Protein 1 Protein 2 Interface Complex 1-2 [Hubbard SJ, Thornton JM] Naccess V Atomic Solvent Accessible Area Calculations

Structural information available in the PDB

Sequence- and interactome-based pipeline to locate binding sites in Arabidopsis proteins
Sequences -> The evolutionary conservation; Sequences -> Residue surface accessibility; Interactome -> Overrepresented motifs; Motif that are: likely to be exposed in the surface; conserved across species; and overrepresented in pairs of interacting proteins. So; to predict the binding sites we proposed a pipeline that uses the protein sequences and the protein-protein interaction network. In our pipeline, we use calculate three key information: From the protein sequences, we calculated the evolutionary conservation. For each protein, we align its sequence with the sequence of othologs found in close the proteome of closely related species; and from this alignment the conservation of each amino acid is calculated. Well, as we know, the conservation score can be used to distinguish functionally and structurally important residues from the rest of the protein. So, in addition, the residue surface accessibility score is also calculated from the protein sequences. This score represent how likely to be exposed in the surface of the structure the amino acid is. As we know saw it, this information can be used to distinguish conserved structural residues in the core of the structure from functional residues located in the surface. Both information is integrated in an algorithm we have designed to mine the interactome to calculate overrepresented motifs in pairs of interacting proteins. So, as result, we expect to find motifs that are: likely to be exposed in the surface; conserved across species; and overrepresented in pairs of interacting proteins. Here a schematic representation of the pipeline, that summarize what I said: we mine protein-protein interaction network those motifs that are overrepresented in pairs of interacting proteins, thus thought to be those composing the interface. Besides of overrepresented, a motif has to have high evolutionary conservation and also high surface accessibility score – which means that they are likely to be exposed in the surface.

Sequence- and interactome-based pipeline to locate binding sites in Arabidopsis proteins
SHY2 IAA16 IAA7 IAA18 TPL IAA1 IAA2 IAA11 So; to predict the binding sites we proposed a pipeline that uses the protein sequences and the protein-protein interaction network. In our pipeline, we use calculate three key information: From the protein sequences, we calculated the evolutionary conservation. For each protein, we align its sequence with the sequence of othologs found in close the proteome of closely related species; and from this alignment the conservation of each amino acid is calculated. Well, as we know, the conservation score can be used to distinguish functionally and structurally important residues from the rest of the protein. So, in addition, the residue surface accessibility score is also calculated from the protein sequences. This score represent how likely to be exposed in the surface of the structure the amino acid is. As we know saw it, this information can be used to distinguish conserved structural residues in the core of the structure from functional residues located in the surface. Both information is integrated in an algorithm we have designed to mine the interactome to calculate overrepresented motifs in pairs of interacting proteins. So, as result, we expect to find motifs that are: likely to be exposed in the surface; conserved across species; and overrepresented in pairs of interacting proteins. Here a schematic representation of the pipeline, that summarize what I said: we mine protein-protein interaction network those motifs that are overrepresented in pairs of interacting proteins, thus thought to be those composing the interface. Besides of overrepresented, a motif has to have high evolutionary conservation and also high surface accessibility score – which means that they are likely to be exposed in the surface.

>Protein sequenceN Input Interacting list
Sequence- and interactome-based pipeline to locate binding sites in Arabidopsis proteins Input fasta sequences >Protein sequence1 >Protein sequence2 ... >Protein sequenceN Input Interacting list Protein1-Protein2 Protein2-Protein4 ... ProteinN-ProteinM Conservation Conservation Protein 1 Conservation Protein 2 ... Conservation Protein N RSA RSA Protein 1 RSA Protein 2 ... RSA Protein N Calculate conservation score Al2CO3 Find orthlogs from each protein sequence OrthoMCL1 Best blast reciprocal hint2 Predict residue surface accessibility (RSA) SABLE4 So; to predict the binding sites we proposed a pipeline that uses the protein sequences and the protein-protein interaction network. In our pipeline, we use calculate three key information: From the protein sequences, we calculated the evolutionary conservation. For each protein, we align its sequence with the sequence of othologs found in close the proteome of closely related species; and from this alignment the conservation of each amino acid is calculated. Well, as we know, the conservation score can be used to distinguish functionally and structurally important residues from the rest of the protein. So, in addition, the residue surface accessibility score is also calculated from the protein sequences. This score represent how likely to be exposed in the surface of the structure the amino acid is. As we know saw it, this information can be used to distinguish conserved structural residues in the core of the structure from functional residues located in the surface. Both information is integrated in an algorithm we have designed to mine the interactome to calculate overrepresented motifs in pairs of interacting proteins. So, as result, we expect to find motifs that are: likely to be exposed in the surface; conserved across species; and overrepresented in pairs of interacting proteins. Here a schematic representation of the pipeline, that summarize what I said: we mine protein-protein interaction network those motifs that are overrepresented in pairs of interacting proteins, thus thought to be those composing the interface. Besides of overrepresented, a motif has to have high evolutionary conservation and also high surface accessibility score – which means that they are likely to be exposed in the surface.

Assessment of the pipeline's performance
Predicted motifs Interface motif Non-interface motifs False Positives (FP) True Positives (TP) Precision = TP/(TP + FP)

Assessment of the pipeline's performance
Coverage: up to 42%, 22% and 42%, respectively for the human, yeast and Arabidopsis subsets. Precision: up to 58%, 96% and 100%. So, the assessment of our pipeline relies on available structural information, because; if we have the structure of a complex, we can directly identify the residues that are located in the interaction interface; therefore we can directly calculate if the predicted motifs are correctly or located in the interface, and also how much of the interface residues have been correctly predicted. Here, we have a problem duo the fact that the number of available structural information for Arabidopsis interacting proteins is extremely low. So to solve this problem and create basis for statistical assessment of the performance of our pipeline, we used the interactome of two other species from which much more structural information are available: the human and the yeast interactomes. In this figures, we see the graphical representation of the human, yeast and Arabidopsis interactomes, respectively in A, B and C. Here, those proteins and interactions from which we have structural information are highlighted in black. And below the graphical representation of the interactomes, we see the numbers of protein and interactions in the protein-protein interaction networks, and also the numbers of proteins and interactions from which we have structural information. So, to be fair in assessing our pipeline towards large-scale predictions of binding motifs, we executed our pipeline in these three structural datasets and calculated the Precision and Coverage for different settings of parameters. So, without entering in details of the statistical assessment; we found that we could predict protein-protein binding motifs with coverage up to 42%, 22% and 42%, respectively for human, yeast and Arabidopsis subsets. Likewise, we estimate precisions with values up to 58%, 96% and 100%.

Locating interaction binding sites in Arabidopsis sequences at a large scale – Overview
Predicted motifs: 1498 interactions among 985 proteins 36% of the proteins in the interactome and ~5.5% of all Arabidopsis proteins Validation and bioinformatics analysis Once we have assessed the performance of our pipeline, I went for the predictions in the complete Arabidopsis interactome data. In this figure we have the graphical representation of the Arabidopsis interactome, as published last year in Science. Here, in black I highlighted those proteins and interaction to which we could map a predicted motifs. Overall, we have that: we could predict motifs that explain 1498 interactions among 985 proteins; which represents 36% of the proteins represented in the published interactome data; and about ~5.5% of all Arabiodopsis proteins. As far as I know, this is the largest scale prediction ever done; and not only for Arabidopsis proteins, So, next I’ll show what I’ve seen in these predictions till now.

Comparison with single nucleotide polymorphism (SNP) data
nsSNP’s Protein sequence Predicted protein-protein binding sites nsSNPs(protein sequence):2.2% > nsSNPs(binding sites):1.6% Functional constraints Intermolecular coevolution Since we can’t count with 3D-strucure availability to assess our large-scale predictions of binding sites in Arabidopsis proteins, we decided to analyse if there is evolutionary evidence for the functional importance of the predictions. If our predicted interaction sites are indeed functionally important, we would expect less variability in their positions compared to the rest of the protein sequence. To test this hypothesis, we calculated the percentage of predicted interface residues in which a non-synonymous SNP (nsSNP) is found (1.6%); in comparison with the percentage of all protein residues in which a nsSNP is found (2.2%). The difference among these values suggests that the predicted sites are under stronger evolutionary constraints than the rest of the protein sequence. We have not accessed the significance yet but we believe that this difference it’s really significant. In addition we statistically observed that those few proteins in which a non-synonymous SNP overlaps a predicted binding sites; have a tendency to interact with other proteins in which a non-synonymous SNP is also found in a binding site; which is consistent with the intermolecular co-evolution model. This co-evolution model states that most interactions are conserved within a species; and in order to maintain the stability of a interaction; mutations in the binding site of one protein; can be compensated by mutations in its interacting partners. The results we found here may suggest a tendency for interface residues to evolve coordinately; which would add support for the functional importance of our predicted binding sites.

Comparison with annotation of amino acid mutagenesis
Proteins with a predicted motif amino acid mutagenesis n=985 Protein sequence Others functionally important sites Protein-protein binding sites DNA binding sites Mutagenesis annotation (UniProt) (n=38) 16 cases: predicted motifs overlap the mutated amino acid So, first we tried to assess the overall relevance of our predictions; and then we looked at each case. To do so, we looked at the annotation of available amino acid mutagenesis data for the proteins we have predicted a binding site. Usually, these amino acid mutagenesis experiments involve residues that are located in the active site, which in a certain number of cases corresponds to the protein-protein interaction site. From the 985 proteins from which we have predicted a binding site; only for 38 we have annotated amino acid mutagenesis experiments. Out of these 38, for 16 proteins the predicted binding site overlaps a mutated amino acid. This is significantly greater than what would be randomly expected; and again suggests that our predictions are functionally relevant.

Some interesting cases
So, first we tried to assess the overall relevance of our predictions; and then we looked at each case. To do so, we looked at the annotation of available amino acid mutagenesis data for the proteins we have predicted a binding site. Usually, these amino acid mutagenesis experiments involve residues that are located in the active site, which in a certain number of cases corresponds to the protein-protein interaction site. From the 985 proteins from which we have predicted a binding site; only for 38 we have annotated amino acid mutagenesis experiments. Out of these 38, for 16 proteins the predicted binding site overlaps a mutated amino acid. This is significantly greater than what would be randomly expected; and again suggests that our predictions are functionally relevant.

Master's Project Proposal: Cross-species analysis of protein-protein binding motifs
So, first we tried to assess the overall relevance of our predictions; and then we looked at each case. To do so, we looked at the annotation of available amino acid mutagenesis data for the proteins we have predicted a binding site. Usually, these amino acid mutagenesis experiments involve residues that are located in the active site, which in a certain number of cases corresponds to the protein-protein interaction site. From the 985 proteins from which we have predicted a binding site; only for 38 we have annotated amino acid mutagenesis experiments. Out of these 38, for 16 proteins the predicted binding site overlaps a mutated amino acid. This is significantly greater than what would be randomly expected; and again suggests that our predictions are functionally relevant.

Question???????

Practical assignment – Perl scripting for

Identification of protein-protein binding motifs

Similar presentations

Presentation on theme: "Identification of protein-protein binding motifs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Identification of protein-protein binding motifs

Similar presentations

Presentation on theme: "Identification of protein-protein binding motifs"— Presentation transcript:

Similar presentations

About project

Feedback