? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.

Slides:



Advertisements
Similar presentations
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Advertisements

Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Protein Modules An Introduction to Bioinformatics.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Similar Sequence Similar Function Charles Yan Spring 2006.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Annotation BCB 660 October 20, From Carson Holt.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Protein Tertiary Structure Prediction
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Protein and RNA Families
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
(H)MMs in gene prediction and similarity searches.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein Tertiary Structure Prediction Structural Bioinformatics.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein databases Henrik Nielsen
Bio/Chem-informatics
Demo: Protein Information Resource
Sequence based searches:
UniProt: Universal Protein Resource
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
BLAST.
Protein Sequence Analysis - Overview -
Basic Local Alignment Search Tool (BLAST)
Volume 26, Issue 3, Pages e2 (March 2018)
Volume 26, Issue 3, Pages e2 (March 2018)
Basic Local Alignment Search Tool
Presentation transcript:

? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific information Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI) Protein Information Resource (PIR) Contact UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG Additional support for the EBI's involvement in UniProt comes from the European Commission contract FELICS (021902) and from the NIH grant 5 P41 HG UniProtKB/Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the NIH grants for NIAID proteomic resource (HHSN C) and grid enablement (NCI-caBIG-ICR), and National Science Foundation grants for protein ontology (ITR ) and BioTagger (IIS ). The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation pir.georgetown.edu/pirsf Correct protein annotation relies on both global (whole protein) and local (domain and motif) sequence similarities. We have developed a method by which annotation of site-specific features can be confidently propagated from experimentally- characterized proteins to uncharacterized proteins. The method relies upon rules that identify the specific amino acids in a protein chain eligible for tagging with appropriate information. Rules are specific for a particular protein family, and rely upon the identification of active site, binding site, modified or other functionally-important residues in a template sequence. A general approach for functional characterization of an unknown protein is to infer function based on similarity to a “best-hit” protein in sequence databases. This is powerful method is nonetheless susceptible to error. In part, these errors can be avoided by using a curated, hierarchical, whole-protein classification database. The advantage is conferred by “strength in numbers.” Instead of relying on the (hopefully) accurate annotation of a single (hopefully related) protein (usually, the BLAST best hit), using curated classification databases allows reliance on the collected wisdom of multiple proteins, or at least the assurance that the members are truly related. The annotation power of protein classification databases is even more powerful if a single database contains families with progressively greater levels of similarity (that is, hierarchies). Theoretically, one query could be confidently predicted to be a member of a parent family, but not a child family, while a different query might be confidently assigned to both levels. In such cases, the most-specific possible annotation could be propagated. Despite these precautions, use of protein classification databases cannot resolve one particular source of error: asserting that a query protein has a particular enzymatic activity, even though it lacks the specific residues responsible for that activity. This is because all current classification algorithms rely on global similarity to make functional inferences. Here we describe a more robust method—PIR Site Rules—for inferring function of uncharacterized proteins. The method has the added benefit of being able to explicitly flag the residues important for a given activity (feature [FT] line), and allowing a rule-based annotation of other data fields, such as protein name (definition [DE] line). After a brief summary, the case of ATP- and PPi- dependent phosphofructokinases will be presented. Summary Position-Specific Features: –active sites –binding sites –modified amino acids Current requirements: –at least one PDB structure –experimental data on functional sites Rule Definition: –Select template and align with PIRSF seed members –Edit MSA: retain conserved regions covering all site residues –Build Site HMM from concatenated conserved regions Algorithm Match Rule Conditions –Membership Check Ensures that the annotation is appropriate –Conserved Region Check (site HMM threshold) –Site Residue Check (all position-specific residues in HMMAlign) Propagate Information –Feature annotation using controlled vocabulary –Evidence attribution (experimental/computational) –Attribute sources and strengths of evidence PIRSR PIRSR Figure 1. Annotate carefully. Annotation is propagated only if all the required residues are present. A protein (P29780) fails the alignment test (red oval), since the rule calls for a cysteine at that position (the query has a glycine). The other two residues (green ovals) are a match. Nonetheless, no information will be propagated (red arrow). The Phosphofructokinase (PFK) Case: ATP versus PPi dependence Figure 2. Importance of annotation based on site-specific features. Rule definition begins with knowledge about residues important for catalytic activity or binding. PFK is a key regulatory enzyme in the Embden-Meyerhoff glycolytic pathway. Classification of PFK proteins revealed that major functional specialization can occur as a result of even a single amino-acid residue change. Two amino acid positions (105 and 125, E. coli numbering) are critical determinants of ATP or PPi utilization (boxes), with one position especially key (arrow). The ability to use ATP depends on the presence of a glycine at the indicated position. Accurate propagation of protein function therefore depends on crafting rules that take advantage of the ability to distinguish between these possibilities, as illustrated in the next figure. Figure 3. PIR Site Rule (PIRSR) definition. Important residues on a template sequence are indicated, along with the appropriate annotation if a query passes all match conditions. Two rules regulate the annotation of ligand. A match to rule 4 means the query is ATP- dependent, while a match to rule 5 means the query is PPi-dependent. Figure 4. Global similarity check. A Leifsonia protein was tested against HMMs for protein families. Q6AG22 matches PIRSF000532, a family that contains mostly ATP-PFKs, but also a few PPi-PFKs. Figure 5. Further confirmation. All the proteins hit by Q6AG22 using BLAST are members of PIRSF (only partial results are shown), hence the protein was added to this family. Note that the best characterized matches are PPi-PFKs, but…. PPi-PFK ATP-PFK Figure 6. Looks can be deceiving. Iterative clustering using BlastClust makes the initial observation that the query (red arrow) might be a PPi-PFK (blue arrows) less certain. Figure 8. Functional variation within one protein family: binding sites with different specificity drive choice of applicable rule to ensure appropriate annotation. Members of the phosphofructokinase (PFK) family evolved into ATP- or PPi-dependent forms. While propagating the name “Phosphofructokinase” to all members would not be inaccurate, it fails to take full advantage of current knowledge. The residues that contribute to each dependency are known (see Functional Site rules PIRSR and PIRSR ). Therefore, DE line annotation should depend on which site rule “fits,” if any. The schematic above indicates the tests and results that occur to propagate name information. Members of PIRSF are tested against the relevant site rules (black arrows). A positive result for the ATP-dependency rule (green arrow) means that the entry would be named “ATP-dependent phosphofructokinase,” according to the PIR Name Rule (PIRNR) PIRNR , while a positive result for the PPi-dependency rule (blue arrow) means that the entry would be named “Pyrophosphate-dependent phosphofructokinase” (PIRNR ). Note that failure to match either one does not mean the activity is missing. Thus, a fall-back rule can be created (the “zero rule,” dotted red arrow) that would propagate simply “Phosphofructokinase” without any qualifier. In the case of the query Q6AG22, having failed both specificity rules, the zero rule would apply. Another possibility. Most ATP-PFKs have the G-G combination, while most PPi-PFK have the D-K combination. Hence, only rules for these two possibilities were written. However, recent evidence indicates that the G-K combination also functions as an ATP_PFK. Thus, a new rule can be written to cover this scenario, and accordingly Q6AG22 would be annotated as an ATP_PFK. Conclusion Critical to our understanding of biology is accurate and up-to-date information. The process of evolution affords us the ability to make inferences about the nature of the proteins that govern biological processes, since like proteins often perform like (if not exact) functions. Unfortunately, this same process has been far from a smooth transition from state to state. The result is that inferences made about one protein based on similarity to another protein using automated methods are often suspect. This is more than a mere annoyance. The lack of rigorous methods for propagating appropriate information hampers knowledge discovery by either reducing the associations that can be made, or by producing associations that should not be made. However, the recent development of methods for better annotation hold much promise for preventing—and even reversing—the previous trend toward rampant misinformation. The combination of hierarchical, whole- protein classifications and rule-based large-scale annotation pipelines is a significant step in the right direction Q9KH71 DALIAIGGEDTLGVASKFSKLGLPMIGVPKTIDKD query DAIIAIGGEGTLTAARRLTDAGLRIVGVPKTIDND P0A796 DALVVIGGDGSYMGAMRLTEMGFPCIGLPGTIDND **::.***:.: * :::. *: :*:* ***:* PIRSR comparison PIRSR comparison FAIL Figure 7. Motif check. The query protein was tested against each of the rules governing ligand binding. Rule 4 (bottom) stipulates that annotation of ATP dependence requires a G-G combination in key positions. The query passes for only the first position, and thus fails the test. Rule 5 (top) stipulates that annotation of PPi dependence requires a D-K combination in key positions. The query passes for only the second position, and thus fails the test. What to do? Site Rules Status 301 PIR site rules covering 168 PIRSFs have been defined. Site information was imported from the Catalytic site residue dataset and Catalytic Site Atlas. 32 PIR site rules covering 19 PIRSF families have been manually curated and submitted to SIB for comments & suggestions. The SIB suggestions will be incorporated and logfiles for annotation propagation to Swiss-Prot entries not already in HAMAP will be submitted to SIB. D.A. Natale, C.R. Vinayaka, and C.H. Wu. Large-scale, classification-driven, rule-based functional annotation of proteins. Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. Bioinformatics Volume, Subramaniam, S. (Ed.) John Wiley & Sons, Ltd

UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI) Protein Information Resource (PIR) Contact UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG Additional support for the EBI's involvement in UniProt comes from the European Commission contract FELICS (021902) and from the NIH grant 5 P41 HG UniProtKB/Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the NIH grants for NIAID proteomic resource (HHSN C) and grid enablement (NCI-caBIG-ICR), and National Science Foundation grants for protein ontology (ITR ) and BioTagger (IIS ). UniProtKB Sequences UniProtKB Isoform Sequences Selected UniParc Sequences from ENSEMBL, RefSeq and PDB databases String Comparison: Identifying sub-fragments and identical sequences CD-HIT computation: Clustering UniRef100 representative sequences at 90% level CD-HIT computation: Clustering UniRef90 representative sequences at 50% level Generating data files for distribution UniRef Release UniRef100 Identical sequences and sub-fragments with 11 or more residues are placed into a single record UniRef90 Members of related UniRef100s at 90% level form a UniRef90 cluster. The representative is selected based on the quality of the entry, name, organism and sequence length. Title and identifier are derived from the representative sequence. UniRef50 Members of related UniRef90s at 50% level form a UniRef90 cluster. The representative is selected based on the quality of the entry, name, organism and sequence length. Title and identifier are derived from the representative sequence. UniProt Non-redundant Reference Cluster (UniRef) databases, UniRef100, UniRef90 and UniRef50 are automatically generated from UniProt Knowledgebase and selected UniParc records. The databases provide complete coverage of sequence space while hiding redundant sequences from view. The non-redundancy allows faster sequence similarity searches by using UniRef90 and UniRef50 UniRef90 40% size Reduction UniRef50 65% size Reduction >UniRef90_P00439 Phenylalanine-4-hydroxylase related cluster MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKIK <UniRef90 xmlns=" Phenylalanine-4-hydroxylase related cluster MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKIK XML file FASTA file UniRef Usages ●Speed up similarity search ●Reducing bias in homology searches by providing more even sequence space ●Using he clusters for family classification ●Using the clusters to annotate EST and other sequence databases ●Using the clusters to check the consistency of UniProtKB annotations