Predicting Active Site Residue Annotations in the Pfam Database

Predicting Active Site Residue Annotations in the Pfam Database
Authors: Jaina Mistry; Alex Bateman; Robert D Finn [Authors of this paper &the PFam database] [ Publication date: 9 August 2007 ] [ BMC Bioinformatics ] Predicting Active Site Residue Annotations in the Pfam Database Presentation by: KEYUR MALAVIYA

TOPICS COVERED Introduction Background Construction and content
Data output and file formats (Utility) Transfer of experimental data within Pfam alignments UniProtKB data CSA data Assessing sensitivity and specificity Comparison: Comparing Pfam to PROSITE Comparing Pfam with MEROPS PROSITE with MEROPS Conclusion Insert a map of your country.

Data output and file formats (Utility) Transfer of experimental data within Pfam alignments UniProtKB data CSA data Assessing sensitivity and specificity Comparison: Comparing Pfam to PROSITE Comparing Pfam with MEROPS PROSITE with MEROPS Conclusion Insert a map of your country. 3

Introduction: Results:
Goal of this Paper: To increase the active site annotations Approach: A strict set of rules are chosen to reduce the rate of false positives  enable the transfer of experimentally determined active site residue data to other sequences within the same Pfam family Results: Only 3% of predicted sequences are false positives Predicted active site residues, of which 94% are not found in UniProtKB The developed tool for transferring the data can be applied to any alignment with associated experimental active site data and is available for download This tool is useful in proteome annotation, comparative genomics, protein evolution and active site characterization Insert a map of your country. 4

Data output and file formats (Utility) Transfer of experimental data within Pfam alignments UniProtKB data CSA data Assessing sensitivity and specificity Comparison: Comparing Pfam to PROSITE Comparing Pfam with MEROPS PROSITE with MEROPS Conclusion This tool is usefulness to proteome annotation, comparative genomics, protein evolution and active site characterization Insert a map of your country. 5

Background: Active Site Predicting Residue Annotations in the
PFam Database Insert a map of your country. 6

Background: PFam Database
Pfam is a collection of protein families and domains Pfam contains multiple protein alignments & profile-HMMs of these families Function: To view the domain organization of proteins 74% of protein sequences have at least one match to Pfam. (Sequence coverage is 74% ) 5% Pfam families are enzymatic From these, a small fraction (<0.5%) have had the residues responsible for catalysis determined The structure and chemical properties of these residues (the active site) determine the chemistry of the enzyme Insert a map of your country. 7

Background: Active site: The active site of an enzyme contains the catalytic and binding sites Binding site is a region on a protein (also DNA or RNA) to which specific other molecules & ions — called ligands Ligand: Binds to & form a complex with a biomolecule to serve a biological purpose. i.e: it is an effector molecule binding to a site on a target protein Enzymes: Controls the flow of metabolites within a cell Catalyze virtually all reactions that make/modify molecules Insert a map of your country. 8

TO DO: Information about other databases:
NCBI BLAST: Catalytic Site Atlas (CSA): UniProtKB: PROSITE: SMART and MEROPS: Insert a map of your country. 9

The problem and the solution
Pfam[1] release 20.0: 8296 protein families % Active site residues experimentally determined: Only ~0.4% sequences in enzymatic Pfam families Need to overcome the lack of experimental data HOW? Computationally predict active sites in protein sequences Two broad categories: 1) computational methods that transfer experimentally characterized active site data by similarity ) those that predict active site residues ab initio ab initio methods: they exploit known properties like: Active sites are usually found buried within a cleft of a protein, Mutations in them can often increase the stability of an enzyme Active sites residues are highly conserved Insert a map of your country. 10

TO DO: ab initio methods:
Geometry data, stability profiles and sequence conservation Evolutionary trace (ET) Neural networks [19] and support vector machines [20, 21] All have a relatively high rate of FPs Insert a map of your country. 11

TO DO: Similarity transfer based methods:
Insert a map of your country. 12

Where we are: Introduction Background Construction and content

Construction and content:
The Pfam database is renowned for having no known false positives in its alignments Achieved by: a set of rules that allows conservative transfer of active site annotation from one protein to another protein in the same Pfam alignment To predict active site residues: identify sequences with experimentally verified active site residues use this information to predict active site residues in other members of that family Next the Algorithm steps for this rule based methodology are mentioned in which steps 1 and 2 are already present in Pfam Insert a map of your country. 14

Logic of the rule based methodology
find a homologous set of proteins & generate a protein alignment: Insert a map of your country. 15

Identify the positions of all experimentally verified active sites in the alignment: Insert a map of your country. 16

Seq1 contains 3 experimental active sites (D, E & H) Seq2 contains 2 experimentally defined active site residues (D & E) Insert a map of your country. Apply step3: H in seq2 is predicted to be an active site residue 18

Insert a map of your country. D in column 13, E in column 43 and H in column 45. 19

Each unannotated sequence in the alignment is analyzed to see if it contains an exact match to the active site pattern Seq1 and Seq 2 now contains 3 experimental active sites (D, E & H) Seq3 contains residues D, E & H in the active site residue columns Insert a map of your country. Apply step5: D, E & H in seq3 are predicted to be active site residues 21

TO DO: Logic of the rule based methodology
when there are two distinct experimentally determined active site patterns within a family, each unannotated sequence is compared as before. There are cases where an unannotated sequence matches more than one active site pattern. Insert a map of your country. 23

TO DO: To TEST this rule based methodology:
8296 alignments from Pfam 20.0 & experimentally verified active site residues from two different databases, UniProtKB & CSA were used Compared the results to each of the database predictions Insert a map of your country. 24

TO DO: Data output and file formats

Transfer of UniProtKB experimental data within Pfam alignments
Use of ‘UniProtKB 8.0’ 2735 experimentally determined active site annotations & alignments in Pfam 20.0 Pfam  predicts 6,06,110 active site residues UniProtKB  predicts 45,685 A-S-R Overlap of predicted A-S-R annotation between ‘Pfam predicted’, & UniProtKB Unable to predict the remaining 23% (10312 residues)? 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position Insert a map of your country. 28

Transfer of UniProtKB experimental data within Pfam alignments
Predictions are based on transferring known experimental data within a Pfam alignment while this 55% doesn’t And this constitutes the sequences 55% (5601) of these 23% were found in Pfam alignments that did not contain experimental UniProtKB A-S-R at that position Insert a map of your country. 29

Transfer of UniProtKB experimental data within Pfam alignments …
Transfer of UniProtKB experimental data within Pfam alignments ….. TO DO: A substantial proportion (96%, residues) of our active site predictions are not present in UniProtKB. This is due to the fact that unlike UniProtKB, which only makes predictions for sequences in UniProtKB/Swiss-Prot, we also make predictions for the automatically generated UniProtKB/TrEMBL entries. Comparing the active site residue prediction for UniProtKB/Swiss-Prot alone, our methodology predicts residues compared with the predicted by UniProtKB. Thus, we have additional active site predictions for the sequences in UniProtKB/Swiss-Prot. In the reverse comparison of UniProtKB against Pfam, UniProtKB only contains 6% of the active site information contained within Pfam. Insert a map of your country. 30

Transfer of CSA experimental data within Pfam alignments …TO DO:
CSA  predicts 5517 active site annotations Pfam  predicts 3523 active site annotations Analysis revealed: For 1376 residues, (49% of the cases) there were no CSA experimental active sites within the Pfam alignments Insert a map of your country. 32

Insert a map of your country.
33

Insert a map of your country.
34

Predicting Active Site Residue Annotations in the Pfam Database

Similar presentations

Presentation on theme: "Predicting Active Site Residue Annotations in the Pfam Database"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predicting Active Site Residue Annotations in the Pfam Database

Similar presentations

Presentation on theme: "Predicting Active Site Residue Annotations in the Pfam Database"— Presentation transcript:

Similar presentations

About project

Feedback