Biological Sequence Databases

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
UniProt - The Universal Protein Resource
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Bioinformatics.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Chapter 14 Protein Structure Classification
Protein families, domains and motifs in functional prediction
Protein databases Henrik Nielsen
Demo: Protein Information Resource
Archives and Information Retrieval
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
There are four levels of structure in proteins
Introduction to Bioinformatics
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
Sequence Based Analysis Tutorial
Protein Sequence Analysis - Overview -
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological Sequence Databases BINF6201/8201 Biological Sequence Databases 09-29-2011

Types of Biological Sequence Databases Bioinformatics Analyses Rely on Various Information in Public Databases Nucleic acid sequence databases General purpose sequence databases Protein sequence databases Biological sequence databases Protein sequence motif and domain databases Special purpose sequence databases Protein structure databases Specialized genome databases Organism-specific databases

Nucleotide Sequence Databases Features Sequences stored in the GenBank at NCBI can be downloaded by anonymous ftp at ftp://ftp.ncbi.nih.gov. International Nucleotide Sequence Database (INSD) GenBank at NCBI European Molecular Biology Laboratories (EMBL) Nucleotide Sequence Database at European Bioinformatics Institute (EBI) DNA database of Japan (DDBI) All published nucleotide sequences are requested to be deposited in the one of these three databases; Data are exchanged among these three databases on daily basis;

Nucleotide Sequence Databases DNA sequence file formats Fasta format: >gi|46048570|ref|NM_203332.1| Rattus norvegicus glutamine/glutamic acid-rich protein A (Grpcb), mRNA CAGTTGGGAGAAAGTCCATAGACTCCTCCAAGATGCTTGTGGTCCTGCTCACAGCAGCCTTGCTGGCTCT GAGCTCAGCTCAGGGCACGGATGAAGAGGTCAACAATGCTGAGACCAGTGATGTACCAGCAGATTCTGAA CAACAACCCGTGGACTCGGGTTCAGATCCACCTTCTGCTGATGCAGATGCAGAGAATGTTCAAGAGGGTG AATCAGCCCCACCAGCAAATGAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGGA ACCCACACAGGCAGAGAATCAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGGAA CCCACACAGGCAGAGAATCAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGCAAC CCACACAGGCAGAGAATCAAGAGCCTCCTGCCACCTCTGGGAGTGAAGAGGAACAGCAGCAGCAGGAATC CACACAGGCAGAGAATCAAGAGCCTTCTGACTCTGCTGGGGAAGGACAGGAAACTCAACCTGAGGAAGGA AATGTAGAGTCACCTCCCTCTTCTCCTGAAAACTCACAAGAACAACCACAGCAAACAAATCCAGAGGAGA AACCGCCTGCTCCTAAGACTCAGGAAGAGCCACAGCACTATAGAGGTCGTCCTCCAAAGAAGATTTTTCC TTTTTTCATTTACAGAGGAAGACCAGTAGTAGTATTCAGGCTCGAGCCTAGGAATCCATTCGCCAGAAGA TTTTAGAGAGTACCTGAGAAGATTATGACCTTCAGATGTGTAGGTCAACAAATCACTGTTGATTGTCTAT AATATTCCAATAAAAATTTTCAGCATGC

Nucleotide Sequence Databases DNA sequence file formats: GenBank format

Protein Sequence Databases The NCBI Entrez Database Stores translated protein sequences of nucleotide sequences from the GenBank/EMBL/DDBJ; Also incorporates protein sequences from SWISS-PROT and Protein Information Resource (PIR); Each protein has a unique gene identification (gi) number; Contain redundant sequences; thus multiple gi’s are associated with the same sequence; The most complete protein sequence database; Annotated by submitters; Sequences are linked to other NCBI databases (PubMed, taxonomy); NR (non-redundant) is based on Entrez, contains a unique set of these sequences. RefSeq contains non-redundant sequences from major organisms for which sufficient data is available, in particular, sequenced genomes.

PIR-PSD (Protein Information Resource-Protein Sequence Database) Protein Sequence Databases PIR-PSD (Protein Information Resource-Protein Sequence Database) The world's first database of classified and functionally annotated protein sequences that grew out of the Atlas of Protein Sequence and Structure (1965-1978) edited by Margaret Dayhoff; Produced and distributed by the Protein Information Resource in collaboration with MIPS (Munich Information Center for Protein Sequences) and JIPID (Japan International Protein Information Database); PIR-PSD has been the most comprehensive and expert-curated protein sequence database in the public domain for over 20 years. In 2002, PIR joined EBI (European Bioinformatics Institute) and SIB (Sweden Institute of Bioinformatics) to form the UniProt consortium. PIR-PSD sequences and annotations have been integrated into UniProt Knowledgebase.

SWISS-PROT and TrEMBL Databases Protein Sequence Databases SWISS-PROT and TrEMBL Databases Swiss-Prot: a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases; originally created by Amos Bairock at Sweden Bioinformatics Institute. TrEMBL: a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot. Since 2002, both SWISS-PROT and TrEMBL become a part of Uniprot databases.

Protein Sequence Databases Uniprot databases (Universal Protein Resource) --- Created by unification of information in three well-known protein databases through a funding from NIH PIR (Protein Information Resource); --- heir to the oldest protein sequence database, Margaret Dayhoff's Atlas of Protein Sequence and Structure 2. SWISS-PROT ; --- Created by Amos Bairock at Sweden Bioinformatics Institute 3. TrEMBL database; --- Most comprehensive catalogue of information on proteins;

Protein Sequence Databases UniProt database (Universal Protein Resource) UniProt is comprised of three components, each is optimized for different purposes: 1. The UniProt Knowledgebase (UniProtKB) 2. The UniProt Reference Clusters (UniRef) databases 3. The UniProt Archive (UniParc) Nucleic Acids Res. 2007 Jan;35(Database issue):D193-7. Epub 2006 Nov 16.

Protein Sequence Databases UniProt Knowledgebase (UniProtKB) In addition to capturing the core data mandatory for each UniProt entry, other available information is also added, including biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data. The UniProtKB consists of two sections: UniProtKB/Swiss-Prot: manually-annotated records supported by literature and curator-evaluated computational analysis UniProtKB/TrEMBL: computationally analyzed records that await full manual annotation

UniProt Reference Clusters (UniRef) Databases Protein Sequence Databases UniProt Reference Clusters (UniRef) Databases The UniRef databases provide clustered sets of sequences from UniProt knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view; The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged UniProt entries, and links to the corresponding UniProt and UniParc records. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues using an algorithm, so that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the representative sequence.

UniProt Archive (UniParc) Protein Sequence Databases UniProt Archive (UniParc) UniParc is a comprehensive non-redundant protein sequence collection; Protein sequences are loaded daily from many different publicly accessible sources to include all known sequence whose host is known While a protein sequence may exist in multiple databases, and even more than once in a given database (with different identifiers), UniParc stores each unique sequence only once and assigns it a unique UniParc identifier. Cross-references back to the source databases are provided, and include source accession numbers, sequence versions, and status (active or obsolete).

Protein structure databases Protein Data Bank Stores 3-D structures of biological macromolecules determined by experimentally (X-ray crystallography and NMR). http://www.pdb.org/pdb/static.do?p=general_information/pdb_statistics/index.html

Protein Structure Databases The SCOP (Structural Classification of Protein) Database Developed by Alexey Murzin and Cyrus Chothia. Hierarchical classification of structural domains of individual PDB entries; The SCOP is organized as a tree structure with the hierarchy of Class: According to the arrangement of secondary structure Fold: Proteins have same major secondary structures in same arrangement with the same topological connections Superfamily: low sequence identities but structures suggest that a common evolutionary origin is probable. Family: Cluster of homologous proteins

SCOP Classes All alpha proteins [46456] (226) [protein domains] (folds) All beta proteins [48724] (149) Alpha and beta proteins (a/b) [51349] (134) Mainly parallel beta sheets (beta-alpha-beta units) Alpha plus beta proteins (a+b) [53931] (286) Mainly antiparallel beta sheets (segregated alpha and beta regions) Multi-domain proteins (alpha and beta) [56572] (48) Folds consisting of two or more domains belonging to different classes Membrane and cell surface proteins and peptides [56835] (49) Does not include proteins in the immune system Small proteins [56992] (79) Usually dominated by metal ligand, heme, and/or disulfide bridges Coiled coil proteins [57942] (7) Not a true class Low resolution protein structures [58117] (24)   Not a true class Peptides [58231] (116) Peptides and fragments. Not a true class Designed proteins [58788] (42) Experimental structures of proteins with essentially non-natural sequences.

Protein Classification Databases The CATH Database Developed by Janet Thornton and Christine A. Orengo. Hierarchical classification of structural domains of individual PDB entries; The CATH is organized as a tree structure with the hierarchy of Class: According to the arrangement of secondary structure Architecture: orientation of secondary structure elements Topology: topological connections between secondary structure elements Homologous superfamily: Cluster of homologous proteins

Protein Sequence Motif Databases Protein sequence motif: a set of conserved amino acid residues that are important for protein function and are located in a short distance from one other. The EF-hand motif |-helixI-----|------loop1------|--helixII--| En**nn**-nX**Y-*Z*G*Ix**zn**nn**n Legend: E = glutamate; n = hydrophobic residue; * = any residue; X = first calcium ligand; Y = second calcium ligand; Z = third calcium ligand; G = glycine; # = fourth calcium ligand, provided by a backbone carbonyl; I = isoleucine (although other aliphatic residues are also found at this position); -X = fifth calcium ligand; -Z = sixth and seventh calcium ligands, provided by a bidentate glutamate or aspartate residue.

Protein Sequence Motif Databases The PROSITE database Created by Amos Bairock at Sweden Bioinformatics Institute Consists of a large collection of biologically meaningful signatures that are described as patterns or profiles. Release 20.17, of 24-Jul-2007 contains --- 1489 documentation entries --- 1319 patterns --- 739 profiles

Protein Sequence Motif Databases The PRINTS Database PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family; Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: the database thus provides a useful adjunct to PROSITE.

Protein Domain Databases Protein domain: a structurally compact, independently folding unit that forms a stable 3-D structure and shows some evolutional conservation. Genetically mobile: domain recombination is one of the major forces to create novel proteins in evolution.

Protein Domain Databases Classification of protein domains from multiple sequence alignments

Protein Domain Databases The Pfam Database Pfam is a collection of protein family alignments which were constructed semi-automatically using hidden Markov models (HMMs), Pfam-A. Sequences that are not covered by Pfam-A are clustered and aligned automatically, and are released as Pfam-B.. Pfam-A has been incorporated into EBI’s InterPro and NCBI’s CDD Version 22.0 (June 2007) contains alignments and models for 9318 protein families, based on the Swiss-prot 51.7 and TrEMBL 34.7 protein sequence databases. The HMMer package provides tools for --- scanning a sequence against the Pfam database --- creating a new HMM from a multiple sequence alignment

Protein Domain Databases The SMART Database SMART (Simple Modular Architecture Research Tool): HMMs based on high-quality manually derived alignments of protein domain families. Contains more than 500 domain families found in signaling, extracellular and chromatin-associated proteins Domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Has been incorporated into EBI’s InterPro and NCBI’s CDD.

Protein Domain Databases The ProDom Database http://www.toulouse.inra.fr/prodom.html The ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. The ProDom database has been designed as a tool to help analyze domain arrangements of proteins and protein families. Strong emphasis has been put on the graphical user interface which allows for interactive analysis of protein homology relationships. Has been incorporated into EBI’s InterPro.

Protein Domain Databases The SUPERFAMILY Database A library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes. Has been incorporated into EBI’s InterPro.

Protein Domain Databases The Gene3D Database Gene3D is a library of hidden Markov models that represent all proteins of known structure. The seed alignments for the models are derived from the proteins found within the homologous superfamily classification level in CATH, which groups together domains that are thought to share a common ancestor. In CATH, similarities at Homologous superfamily-level are identified first by sequence comparisons and subsequently by structure comparisons using SSAP. Gene3D has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the Gene3D website. Has been incorporated into EBI’s InterPro.

Protein Domain Databases The InterPro Database (Integrated Resource of Protein Families, Domains and Sites) InterPro is the unification of the following databases for protein domain and family classification: PROSITE: regular expressions and profiles. PRINTS: fingerprints (groups of aligned, un-weighted motifs). ProDom: uses PSI-BLAST to find homologous sequences, that are clustered in the same ProDom entry. Pfam: SMART: TIGRFAMs: PIRSF: SUPERFAMILY: PANTHER: Gene3D: InproScan program: for scanning the InterPro database The whole database can be downloaded. Hidden Markov models (HMMs)

The number of unique protein folds and families are limited The number of unique protein folds are saturated since 2008 http://www.pdb.org/pdb/static.do?p=general_information/pdb_statistics/index.html The number of unique protein families are saturated since 2008 http://www.pdb.org/pdb/static.do?p=general_information/pdb_statistics/index.html