Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional.

Slides:



Advertisements
Similar presentations
Literature Data Mining and Protein Ontology Development at the Protein Information Resource (PIR) Hu ZZ 1, Mani I 2, Liu H 3, Vijay-Shanker K 4, Hermoso.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Pfam(Protein families )
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis.
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
1 The PIRSF Protein Classification System as a Basis for Automated UniProt Protein Annotation Darren A. Natale, Ph.D. Project Manager and Senior Scientist,
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES: ANNOTATION AND.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Anastasia Nikolskaya PIR (Protein Information Resource), Georgetown University Medical Center PIRSF PROTEIN CLASSIFICATION SYSTEM AND SEQUENCE ANNOTATION.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Based Analysis Tutorial
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
? Functional Site rule: tags active site, binding, other residue- specific information Functional Annotation rule: gives name, EC, other activity- specific.
InterPro Sandra Orchard.
Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Chapter 14 Protein Structure Classification
FUNCTIONAL ANALYSIS OF PROTEIN SEQUENCES:
Demo: Protein Information Resource
UniProt: Universal Protein Resource
Genome Annotation Continued
UniProt: the Universal Protein Resource
PIR: Protein Information Resource
Literature Data Mining and Protein Ontology Development
Sequence Based Analysis Tutorial
Tutorial: Bioinformatics Resources
Protein Sequence Analysis - Overview -
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
Explore Evolution: Instrument for Analysis
Presentation transcript:

Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC PIR: a comprehensive resource for functional analysis of protein sequences and families

2 PIR Web Site NEW web site, soon to become public currently an old version PIR and UniProt web sites interlinked and cross- navigable PIR-specific features Text Search Sequence Search Classification Database Search

3 i Integration of protein family, function, structure Rich links (executive summary + hypertext links) to > 90 databases Value-added reports for 1.96 Million UniProtKB protein entries i iProClass Protein Knowledgebase

4 Example Want to find info on chorismate mutases, Specifically: Start with Bacillus subtilis P19080 = CHMU_BACSUP19080 Relatedness to other chorismate mutases - Homology - Domain architecture - Is it related to E.coli P07022 (a well-studied bifunctional enzyme (P-protein), chorismate mutase/prephenate dehydratase)

5 iProClass Sequence Report

6 What can we find about “chorismate mutase” Protein Analysis: I. Text Search iProClass

7 Text SearchResults (I) UniProt ID

8 Text SearchResults (II) Display options: add or remove columns

9 Text Search Results (III) Find chorismate mutase(s) from B. subtilis

10 Determining Protein Homology Is B. subtilis CM P19080 homologous to E.coli P-protein P07022? to B. subtilis AroA(G) P39912 ? Which domains, if any, in multidomain chorismate mutases it corresponds to? What kinds of domain architecture exist in chorismate mutases?

11 Retrieve Proteins by UID in Batch Mode ID mapping option: can use various non-UniProt IDs Batch Retrieval

12 Determining Protein Homology: Sequence Search BLAST FASTA SSearch

13 Blast Search Results BLAST query UniProt sequence P19080 hits PIRSF family members as best hits

14 Pre-compiled Related Sequences: saves time

15 BLAST/SSEARCH Results SSEARCH Alignment BLAST Alignment

16 Determining Protein Homology: Peptide Search

17 Peptide Search Results

18 Protein families reflect evolutionary relationships Function often follows along the family lines Therefore, matching a protein sequence a protein family provides information about a protein (need a highly curated and annotated family) Faster and often more accurate than searching against a protein database Protein classification facilitates sequence and functional analysis of proteins and is used for accurate automatic annotation (PIRSF is used for UniProt annotation) Family Classification System: One-Stop Platform for Protein Analysis

19 PIRSF Classification System PIRSF: reflects evolutionary relationships of full-length proteins Definitions: Basic unit = Homeomorphic Family Homologous: Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence conservation; Network Structure: multiple domain parents Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology

20 PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.

21 Unclassified UniProtKB proteins Uncurated Homeomorphic Clusters Orphans Preliminary Homeomorphic Families Final Families, Subfamilies, Superfamilies Add/Remove Members Name, Refs, Abstract, Domain Arch. Automatic Clustering Computer- assisted Manual Curation Automatic Procedure Unassigned Proteins Automatic Placement Hierarchies (Superfamilies/Subfamilies) Map Domains on Clusters Merge/Split Clusters New Proteins Protein Name Rules/Site Rules Build and Test HMMs

22

23 Tool: Curator’s Decision Maker

24 Classification Tool: BlastClust Curator-guided clustering Single-linkage clustering using BLAST Retrieve all proteins sharing a common domain Iterative BlastClust (fixed length coverage)

25 Family Analysis of Homologous Proteins 1. Fully Curated Protein Family: Especially important when the protein of interest is underannotated or misannotated (happens often!) Evidence types: Characterized (validated), Predicted (by computational methods) or Uncharacterized 2. Preliminary or Uncurated Family Have to do some analysis OR contact PIR and ask to prioritize this family 3. No Family Classification Have to do some analysis OR contact PIR and ask to prioritize this family iProClass search PIRSF - blank

26 Underannotated Proteins Search iProClass with PIRSF Providing more information

27 PIRSF SCAN (sequence search) UniProt sequence Q8Y5X7 is automatically classified as chorismate mutase of the AroH class PIRSF Returns only matches to fully curated PIRSFs

28 Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF PIRSF Family Report: Curated Protein Family Information Phylogenetic tree and alignment view allows further sequence analysis

29 PIRSF Family Report (II) Integrated value added information from other databases Mapping to other protein classification databases

30 CM from B.subtilis P19080 does not bring B.subtilis AroA(G) or E. coli P-protein (or related proteins) in BLAST search Contains a different PFAM domain Identical conserved motifs are not found NOT homologous PIRSF reports: abstracts contain most of this info PIRSF domain architecture (curated or uncurated): Pfam and newly defined domains Structure information (PDB links) Hierarchy in DAG (under development) Chorismate Mutase Results from iProClass Analysis Use PIRSF family database for the same analysis:

31 PIRSF Text Search New domain AroA(G)

32 Chorismate Mutase Convergent Evolution – EC (Non-Orthologous Gene Displacement) Two Distinct Sequence/Structure Types AroQ Class: SCOP (all  ), core: 6 helices, bundle AroH Class: SCOP (  +  ), core: beta-alpha-beta-alpha-beta(2) Two Pfam Domains: PF01817, PF07736 (New PFAM domain) AroQAroH

33 Developing DAG Viewer Before: all chorismate mutase proteins and families hit PF01817 including PIRSF (not homologous to the rest) Subfamily Network structure (in DAG) for PIRSF family classification system reflects PIRSF family hierarchy which is based on evolutionary relationships

34 DAG Viewer (II) After: PFAM created a new domain PF07736 which is found in PIRSF members “Orphans”: no family classification

35 PIR Team Dr. Cathy Wu, Director Protein Classification team Dr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia Nikolskaya Dr. Darren Natale Dr. Zhang-Zhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Sona Vasudevan Dr. Cecilia Arighi Informatics team Dr. Hongzhan Huang Dr. Peter McGarvey Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jian Zhang, M.S. Dr. Xin Yuan Students Christina Fang Vincent Hermoso Natalia Petrova UniProt is supported by the National Institutes of Health, grant # 1 U01 HG National Institutes of Health