Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY.

Similar presentations


Presentation on theme: "1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY."— Presentation transcript:

1 1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY

2 2Outline What is Bioinformatics? Past & PresentWhat is Bioinformatics? Past & Present About PIRAbout PIR PIR resourcesPIR resources UniProt resourcesUniProt resources PIR’s leading role in CaBig; Biodefense and OntologyPIR’s leading role in CaBig; Biodefense and Ontology

3 3 What is Bioinformatics? NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000)  Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computer + Mouse = Bioinformatics (Information) (Biology)

4 4 “A science which hesitates to forget its founders is lost.” ---- A. N. Whitehead

5 5 Dr. Margaret Oakley Dayhoff (1925 – 1983) The origin of the single-letter code for the amino acids Evolution of Protein databases (Georgetown University)

6 6 Challenges we are facing today! Total number of sequences in NR ~ 4,919,302 Total number of environmental sequences ~6,028,191(NCBI) Number of domain Families (Pfam) ~ 8957 Number of domain Families (SMART) ~ 665 Number of Structures (PDB) ~ 43339 Number of COGS ~4873 (Unicellular) ~4852 (Eukaryote)

7 7 Molecular Biology Databases 719 Databases in 14 categories The DNA sequence database has exceeded 100 gigabases.

8 8 the birth of “omes” & "omic" era in biology

9 9 Genomics Proteomics Unknomics Functionomics Metagenomics

10 10

11 11 Protein Information Resource  UniProt Universal Protein Resource: Central Resource of Protein Sequence and Function  PIRSF Protein Family Classification System: Protein Classification and Functional Annotation  iProClass Integrated Protein Knowledgebase: Data Integration and Functional Associative Analysis http://pir.georgetown.edu Integrated Protein Informatics Resource for Proteomics Research

12 12 UniProt Databases  UniParc: Comprehensive Sequence Archive with Sequence History  UniProt: Knowledgebase with Full Classification and Functional Annotation  UniRef: Non-redundant Reference Databases for Sequence Search Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging Classification, Literature-Based & Automated Annotation UniParc (Archive) UniRef100 (NREF) Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data Swiss- Prot PIR-PSDTrEMBL RefSeq GenBank/ EMBL/DDBJ EnsemblPDB Patent Data Other Data UniProt (Knowledgebase) Clustering at 100, 90, 50% Identity UniRef90 UniRef50 Merging

13 13 UniProt Knowledgebase  Objective: Stable, Comprehensive, Fully Classified, Richly and Accurately Annotated  Information Content Isoform Presentation Isoform Presentation Nomenclature Nomenclature Family Classification and Domain Identification Family Classification and Domain Identification Functional Annotation Functional Annotation  Approaches Full Classification Full Classification Automated Annotation Automated Annotation Literature-Based Curation Literature-Based Curation Database Cross-References Database Cross-References Controlled Vocabularies & Ontologies Controlled Vocabularies & Ontologies Evidence Attribution Evidence Attribution

14 14 PIRSF Classification System  PIRSF: Reflects evolutionary relationships of full-length proteins Reflects evolutionary relationships of full-length proteins A network structure from superfamilies to subfamilies A network structure from superfamilies to subfamilies  Definitions: Homeomorphic Family (HF): Basic Unit Homeomorphic Family (HF): Basic Unit Homologous: Common ancestry, inferred by sequence similarity Homologous: Common ancestry, inferred by sequence similarity Homeomorphic: Full-length similarity & common domain architecture Homeomorphic: Full-length similarity & common domain architecture Hierarchy: Flexible number of levels with varying degrees of sequence conservation Hierarchy: Flexible number of levels with varying degrees of sequence conservation Network Structure: Allows multiple parents Network Structure: Allows multiple parents  Advantages: Annotate both general biochemical and specific biological functions Annotate both general biochemical and specific biological functions Accurate propagation of annotation and development of standardized protein nomenclature and ontology Accurate propagation of annotation and development of standardized protein nomenclature and ontology Credit AN Nikolskaya

15 15 PIRSF Classification System Protein Classification and Functional Annotation (http://pir.georgetown.edu/pirsf/)  Comprehensive Classification of All UniProt Proteins  Curated Families with Protein Name and Site Rules  Classification and Visualization Tools Taxonomy Distribution and Phylogenetic Pattern Iterative BlastClust Tree with Annotation Table, MSA & Phylogenetic tree

16 16 Classification Tool: BlastClust  Curator- guided clustering  Single- linkage clustering using BLAST  Retrieve all proteins sharing a common domain  Iterative BlastClust (fixed length coverage)

17 17 PIRSF-Based Protein Annotation Classification-Driven Rule-Based Annotation Provides Consistent Annotation and Database Integrity Check Includes: Site Rule (PIRSR): Position-Specific Site Feature (FT) Name Rule (PIRNR): transfer name from PIRSF to individual proteins Protein Name (DE) with Synonym, EC, Misnomer GO Term Rule ID Rule Condition Rule Description (Name Rule Interface) PIRNR000881 -1 PIRSF000881 member and vertebrates Name: S-acyl fatty acid synthase thioesterase EC: oleoyl-[acyl-carrier-protein] hydrolase (EC 3.1.2.14) PIRNR000881 -2 PIRSF000881 member and not vertebrates Name: Type II thioesterase EC: thiolester hydrolases (EC 3.1.2.-) PIRNR025624 -1 PIRSF025624 member Name: ACT domain protein Misnomer: chorismate mutase

18 18 Rule-based Annotation of Protein Entries Using PIRSF StructureBinding/active sitesIdentification of residues

19 19Methodology  Defining a Rule Select template structure Select template structure Align curated PIRSF seed members and structural template Align curated PIRSF seed members and structural template Structure-based sequence alignment of seeds Structure-based sequence alignment of seeds Edit MSA retaining conserved regions covering all site residues Edit MSA retaining conserved regions covering all site residues Build Site HMM from concatenated conserved regions Build Site HMM from concatenated conserved regions  Rule Condition Membership Check (PIRSF HMM threshold) Membership Check (PIRSF HMM threshold) Conserved Region Check (site HMM threshold) Conserved Region Check (site HMM threshold) Site Residue Check (position-specific residue in HMMAlign) Site Residue Check (position-specific residue in HMMAlign)  Rule Propagation Propagate conserved feature annotation to all members that fit the rule Propagate conserved feature annotation to all members that fit the rule

20 20 An example of PIR rule Integrated into SP record PIR Rule

21 21 PIRSF Protein Classification provides a platform for protein annotation  Improves Annotation Quality Annotation of biological function of whole proteins Annotation of biological function of whole proteins Annotation of uncharacterized hypothetical proteins (functional predictions helped by newly detected family relationships) Annotation of uncharacterized hypothetical proteins (functional predictions helped by newly detected family relationships) Correction of annotation errors Correction of annotation errors Improvement of under- or over-annotated proteins Improvement of under- or over-annotated proteins  Standardization of Protein Names

22 22 Data Integration  Data Warehouse Local Copy of Databases in a Unified Database Schema Local Copy of Databases in a Unified Database Schema Allows Local Control of Data; Update Problem Allows Local Control of Data; Update Problem  Hypertext Navigation Browsing Model with Hypertext Links Browsing Model with Hypertext Links Allows Direct Interaction; Easily Lost in Cyberspace Allows Direct Interaction; Easily Lost in Cyberspace  iProClass Approach Data Warehouse + Hypertext Navigation Data Warehouse + Hypertext Navigation Rich Links (Links + Executive Summaries) Rich Links (Links + Executive Summaries) Modular and Open Framework for Adding New Components in Distributed Networking Environment Modular and Open Framework for Adding New Components in Distributed Networking Environment

23 23 iProClass Database  ~5,000,000 Protein Sequences  Rich Links to >80 Databases  Value-Added Views for UniProt Integrated Protein Family, Function, Structure Integrated Protein Family, Function, Structure Information

24 24 iProClass Views Sequence Report Family Report

25 25 PIR iProClass Searches Text Search Peptide Search BLAST Search ID Mapping

26 26 1.Albert Einstein College of Medicine T. gondii, C. parvum 2.Caprion Pharmaceuticals B. abortus 3.Harvard Institute of Proteomics V. cholerae, B. anthracis 4.Myriad Genetics B. anthracis, Y. pestis, F. tularensis, Vaccinia, Variola 5.Pacific Northwest National Laboratory S. typhimurium, S. typhi, Vaccinia, Monkeypox 6.Scripps SARS CoV, Influenza 7.University of Michigan B. anthracis Scripps Caprion Myriad Harvard U of Michigan Albert Einstein PNNL Resource Center SSS PIRVBI DATADATA

27 27 Organism Research Center Data Type

28 28 Currently contains 3,733 ORF Clones out of 3,784 Proteins Master Protein Directory 29 Colonization Pathway Proteins

29 29 Protein Summary ReportClone Sequences Order Clones from Repositories Protein and Reagent Information Search for Related Proteins in Catalog by Family Classification or Similarity Searches

30 Mouse proteins detected in B. anthracis and S. typhimurium infected macrophages

31 NCI caBIG Initiative cancer Biomedical Informatics Grid: Informatics platform to enable sharing of research, data and tools Designed and built by an open federation of organizations Facilitate connectivity via common standards and unifying architecture Open source and open access principles Domain Workspaces Clinical Trial Management Systems Integrative Cancer Research Imaging Tissue Banks and Pathology Tools Cross Cutting Workspaces Architecture Vocabularies and Common Data Elements

32 PIR Activities in caBIG™ Integrative Cancer Research Workspace Developer Grid-enablement of PIR Adopter SEED Genome Annotation Tool (completed) GeneConnect Genomic Identifier Mapping Service Vocabularies and Common Data Elements Participant

33 33


Download ppt "1 Protein Bioinformatics – Advances and Challenges Sona Vasudevan Peter McGarvey BY."

Similar presentations


Ads by Google