Biological databases an introduction By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007 By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
On line (DNA and amino acid) Sequence Information Lecture 7.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein Databases EBI – European Bioinformatics Institute
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
UniProt - The Universal Protein Resource
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Joint EBI-Wellcome Trust Summer School June 2010.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biological databases Nicky Mulder:
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
1 EMBL Outstation — The European Bioinformatics Institute Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Function preserves sequences
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
Copyright OpenHelix. No use or reproduction without express written consent1.
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data.
Protein databases Henrik Nielsen
Archives and Information Retrieval
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
Introduction to Bioinformatics
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological databases an introduction By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007 By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007

Biological Databases  Sequence Databases  Genome Databases  Structure Databases  Sequence Databases  Genome Databases  Structure Databases

Sequence Databases  The sequence databases are the oldest type of biological databases, and also the most widely used

Sequence Databases  Nucleotide: ATGC  Protein: MERITSAPLG  Nucleotide: ATGC  Protein: MERITSAPLG

The nucleotide sequence repositories  There are three main repositories for nucleotide sequences: EMBL, GenBank, and DDBJ.  All of these should in theory contain "all" known public DNA or RNA sequences  These repositories have a collaboration so that any data submitted to one of databases will be redistributed to the others.  There are three main repositories for nucleotide sequences: EMBL, GenBank, and DDBJ.  All of these should in theory contain "all" known public DNA or RNA sequences  These repositories have a collaboration so that any data submitted to one of databases will be redistributed to the others.

 The three databases are the only databases that can issue sequence accession numbers.  Accession numbers are unique identifiers which permanently identify sequences in the databases.  These accession numbers are required by many biological journals before manuscripts are accepted.  The three databases are the only databases that can issue sequence accession numbers.  Accession numbers are unique identifiers which permanently identify sequences in the databases.  These accession numbers are required by many biological journals before manuscripts are accepted.

 It should be noted that during the last decade several commercial companies have engaged in sequencing ESTs and genomes that they have not made public.

EST databases  Expressed sequence tags (ESTs) are short sequences from expressed mRNAs.  The basic idea is to get a handle on the parts of the genome that is expressed as mRNA (often called the transcriptome ).  ESTs are generated by end-sequencing clones from cDNA libraries from different sources.  Expressed sequence tags (ESTs) are short sequences from expressed mRNAs.  The basic idea is to get a handle on the parts of the genome that is expressed as mRNA (often called the transcriptome ).  ESTs are generated by end-sequencing clones from cDNA libraries from different sources.

EST cluster databases  UniGene  UniGene is a database at NCBI that contains clusters (UniGene clusters) of sequences that represent unique genes. These cluster are made automatically by partitioning GenBank sequences into a non- redundant set of gene-oriented clusters.  UniGene  UniGene is a database at NCBI that contains clusters (UniGene clusters) of sequences that represent unique genes. These cluster are made automatically by partitioning GenBank sequences into a non- redundant set of gene-oriented clusters.

Ideal minimal content of a « sequence » db Sequences !! Accession number (AC) References Taxonomic data ANNOTATION/CURATION Keywords Cross-references Documentation Sequences !! Accession number (AC) References Taxonomic data ANNOTATION/CURATION Keywords Cross-references Documentation

Example: Swiss-Prot entry Example: Swiss-Prot entry sequence Accession number Entry name

Protein name Gene name Protein name Gene name Taxonomy

References

Comments

Cross-references

Keywords

Feature table (sequence description)

Sequence database: example …a SWISS-PROT entry, in fasta format: >sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens(Human). MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR …a SWISS-PROT entry, in fasta format: >sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens(Human). MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR

SWISS-PROT knowledgebase  Created by Amos Bairoch in 1986  Collaboration between the SIB (CH) and EBI (UK)  Annotated (manually), non-redundant, cross- referenced, documented protein sequence database.  ~122 ’000 sequences from more than 7’700 different species; 192 ’000 references (publications); 958 ’000 cross-references (databases); ~400 Mb of annotations.  Weekly releases; available from more than 50 servers across the world, the main source being ExPASy  Created by Amos Bairoch in 1986  Collaboration between the SIB (CH) and EBI (UK)  Annotated (manually), non-redundant, cross- referenced, documented protein sequence database.  ~122 ’000 sequences from more than 7’700 different species; 192 ’000 references (publications); 958 ’000 cross-references (databases); ~400 Mb of annotations.  Weekly releases; available from more than 50 servers across the world, the main source being ExPASy

SWISS-PROT: species  7’700 different species  20 species represent about 42% of all sequences in the database  5’000 species are only represented by one to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study  7’700 different species  20 species represent about 42% of all sequences in the database  5’000 species are only represented by one to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study

Domains, functional sites, protein families PROSITE InterPro Pfam PRINTS SMART Mendel-GFDb Domains, functional sites, protein families PROSITE InterPro Pfam PRINTS SMART Mendel-GFDb Nucleotide sequence db EMBL, GeneBank, DDBJ Nucleotide sequence db EMBL, GeneBank, DDBJ 2D and 3D Structural dbs HSSP PDB 2D and 3D Structural dbs HSSP PDB Organism-spec. dbs DictyDb EcoGene FlyBase HIV MaizeDB MGD SGD StyGene SubtiList TIGR TubercuList WormPep Zebrafish Organism-spec. dbs DictyDb EcoGene FlyBase HIV MaizeDB MGD SGD StyGene SubtiList TIGR TubercuList WormPep Zebrafish Protein-specific dbs GCRDb MEROPS REBASE TRANSFAC Protein-specific dbs GCRDb MEROPS REBASE TRANSFAC SWISS-PROT 2D-gel protein databases SWISS-2DPAGE ECO2DBASE HSC-2DPAGE Aarhus and Ghent MAIZE-2DPAGE 2D-gel protein databases SWISS-2DPAGE ECO2DBASE HSC-2DPAGE Aarhus and Ghent MAIZE-2DPAGE Human diseases MIM Human diseases MIM PTM CarbBank GlycoSuiteDB PTM CarbBank GlycoSuiteDB

Annotations  Function(s)  Post-translational modifications (PTM)  Domains  Quaternary structure  Similarities  Diseases, mutagenesis  Conflicts, variants  Cross-references …  Function(s)  Post-translational modifications (PTM)  Domains  Quaternary structure  Similarities  Diseases, mutagenesis  Conflicts, variants  Cross-references …

Annotation schema Amos Bairoch Head annotator 1 Head annotator n Head annotator 2 Annotators Experts … … … … SwissProt

Code Content Occurrence in an entry ID Identification One; starts the entry AC Accession number(s) One or more DT Date Three times DE Description One or more GN Gene name(s) Optional OS Organism species One or more OG Organelle Optional OC Organism classification One or more OXTaxonomy cross-referencesOne or more RN Reference number One or more RP Reference position One or more RC Reference comment(s) Optional RX Reference cross-reference(s) Optional RA Reference authors One or more RT Reference title Optional RL Reference location One or more CC Comments or notes Optional DR Database cross-references Optional KW Keywords Optional FT Feature table data Optional SQ Sequence header One Amino Acid Sequence One or more // Termination line One; ends the entry Code Content Occurrence in an entry ID Identification One; starts the entry AC Accession number(s) One or more DT Date Three times DE Description One or more GN Gene name(s) Optional OS Organism species One or more OG Organelle Optional OC Organism classification One or more OXTaxonomy cross-referencesOne or more RN Reference number One or more RP Reference position One or more RC Reference comment(s) Optional RX Reference cross-reference(s) Optional RA Reference authors One or more RT Reference title Optional RL Reference location One or more CC Comments or notes Optional DR Database cross-references Optional KW Keywords Optional FT Feature table data Optional SQ Sequence header One Amino Acid Sequence One or more // Termination line One; ends the entry Manual annotation

TrEMBL (Translated EMBL)  TrEMBL: created in 1996;  Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…  Well-structure SWISS-PROT-like resource  Derived from automated EMBL CDS translation (maintained at the EBI (UK))  TrEMBL is automatically generated and annotated using software tools (incompatible with the SWISS-PROT in terms of quality)  TrEMBL contains all what is not yet in SWISS-PROT  Yerk!! But there is no choice and these software tools are becoming quite good !  TrEMBL: created in 1996;  Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data…  Well-structure SWISS-PROT-like resource  Derived from automated EMBL CDS translation (maintained at the EBI (UK))  TrEMBL is automatically generated and annotated using software tools (incompatible with the SWISS-PROT in terms of quality)  TrEMBL contains all what is not yet in SWISS-PROT  Yerk!! But there is no choice and these software tools are becoming quite good !

The simplified story of a Sprot entry cDNAs, genomes, …. EMBLnewEMBL TrEMBLnewTrEMBL SWISS-PROT « Automatic » Redundancy check (merge) InterPro (family attribution) Annotation « Automatic » Redundancy check (merge) InterPro (family attribution) Annotation « Manual » Redundancy ( merge, conflicts) Annotation Sprot tools (macros…) Sprot documentation Medline Databases (MIM, MGD….) Brain storming « Manual » Redundancy ( merge, conflicts) Annotation Sprot tools (macros…) Sprot documentation Medline Databases (MIM, MGD….) Brain storming Once in Sprot, the entry is no more in TrEMBL, but still in EMBL (archive) CDS

TrEMBL: example Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore. Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.

Some protein motif databases  Prosite - Regular expression built from SWISS-PROT  PRINTS - aligned motif consensus built from OWL (  BLOCKS - PRINTS-like generated from PROSITE families (  IDENTIFY - Fuzzy regular expressions derived from PROSITE  pfam - Hidden Markov Model built from SWISS-PROT (  Profiles - Weight Matrix profiles built from SWISS-PROT  Interpro - All of the above (almost) (  Prosite - Regular expression built from SWISS-PROT  PRINTS - aligned motif consensus built from OWL (  BLOCKS - PRINTS-like generated from PROSITE families (  IDENTIFY - Fuzzy regular expressions derived from PROSITE  pfam - Hidden Markov Model built from SWISS-PROT (  Profiles - Weight Matrix profiles built from SWISS-PROT  Interpro - All of the above (almost) (

A domain database synchronised with SWISS-PROT

History  Founded by Amos Bairoch  1988 First release in the PC/Gene software  1990 Synchronisation with Swiss-Prot  1994 Integration of « profiles »  1999 PROSITE joins InterPro  January 2003 Current release  Founded by Amos Bairoch  1988 First release in the PC/Gene software  1990 Synchronisation with Swiss-Prot  1994 Integration of « profiles »  1999 PROSITE joins InterPro  January 2003 Current release The database

Database content  Official Release  ~1330Patterns PSxxxxxPATTERN  ~252Profiles PSxxxxxMATRIX  4Rules PSxxxxxRULE  ~1156Documentations PDOCxxxxx  Pre-Release  ~150ProfilesPSxxxxxMATRIX  ~100DocumentationsQDOCxxxxx  Official Release  ~1330Patterns PSxxxxxPATTERN  ~252Profiles PSxxxxxMATRIX  4Rules PSxxxxxRULE  ~1156Documentations PDOCxxxxx  Pre-Release  ~150ProfilesPSxxxxxMATRIX  ~100DocumentationsQDOCxxxxx

Prosite (pattern): example

Database content: documentation

Other protein domain/family db PROSITEPatterns / Profiles ProDomAligned motifs (PSI-BLAST) (Pfam B) PRINTSAligned motifs PfamHMM (Hidden Markov Models) SMARTHMM TIGRfamHMM DOMOAligned motifs BLOCKSAligned motifs (PSI-BLAST) CDD(CDART)PSI-BLAST(PSSM) of Pfam and SMART PROSITEPatterns / Profiles ProDomAligned motifs (PSI-BLAST) (Pfam B) PRINTSAligned motifs PfamHMM (Hidden Markov Models) SMARTHMM TIGRfamHMM DOMOAligned motifs BLOCKSAligned motifs (PSI-BLAST) CDD(CDART)PSI-BLAST(PSSM) of Pfam and SMART In te rp ro Text

InterPro:

InterPro example

InterPro graphic example

Genomic Databases  Genome databases differ from sequence databases in that the data contained in them are much more diverse.  The idea behind a genome database is to organize all information on an organism (or as much as possible).  In many cases they stem out of the necessity for a centralized resource for a particular genome project. But of course they are also important resources for the research community.  Genome databases differ from sequence databases in that the data contained in them are much more diverse.  The idea behind a genome database is to organize all information on an organism (or as much as possible).  In many cases they stem out of the necessity for a centralized resource for a particular genome project. But of course they are also important resources for the research community.

Genomic Databases  Ensembl  Genome Browser  NCBI  Ensembl  Genome Browser  NCBI

Structure Databases  PDB  SCOP  PDB  SCOP

PDB  The Protein Data Bank ( PDB ) was established at Brookhaven National Laboratories (BNL) (1) in 1971 as an archive for biological macromolecular crystal structures.  The three dimensional structures in PDB are primarily derived from experimental data obtained by X-ray crystallography and NMR.  The Protein Data Bank ( PDB ) was established at Brookhaven National Laboratories (BNL) (1) in 1971 as an archive for biological macromolecular crystal structures.  The three dimensional structures in PDB are primarily derived from experimental data obtained by X-ray crystallography and NMR.

SCOP  The SCOP database groups different protein structures according to their evolutionary relationship.The evolutionary relationship of all known protein structures have been determined by manual inspection and automated methods.  The goal of SCOP is to provide detail information about close relatives of proteins and protein and to provide an evolutionary based protein classification resource.  The SCOP database groups different protein structures according to their evolutionary relationship.The evolutionary relationship of all known protein structures have been determined by manual inspection and automated methods.  The goal of SCOP is to provide detail information about close relatives of proteins and protein and to provide an evolutionary based protein classification resource.

UniProt: United Protein database  SWISS-PROT + TrEMBL + PIR = UniProt  Born in October 2002  NIH pledges cash for global protein database  The United States is turning to European bioinformatics facilities to help it meet its researchers' future needs for databases of protein sequences.  European institutions are set to be the main recipients of a $15- million, three-year grant from the US National Institutes of Health (NIH), to set up a global database of information on protein sequence and function known as the United Protein Databases, or UniProt (Nature, 419, 101 (2002))  SWISS-PROT + TrEMBL + PIR = UniProt  Born in October 2002  NIH pledges cash for global protein database  The United States is turning to European bioinformatics facilities to help it meet its researchers' future needs for databases of protein sequences.  European institutions are set to be the main recipients of a $15- million, three-year grant from the US National Institutes of Health (NIH), to set up a global database of information on protein sequence and function known as the United Protein Databases, or UniProt (Nature, 419, 101 (2002))

Some examples of integrated biological database resources are:  SRS (Sequence Retrieval System)  Entrez Browser (at NCBI)  ExPASy (home of SwissProt)  Ensembl (Open Source based system)  Human Genome Browser (Jim Kents creation)  SRS (Sequence Retrieval System)  Entrez Browser (at NCBI)  ExPASy (home of SwissProt)  Ensembl (Open Source based system)  Human Genome Browser (Jim Kents creation)

THANKS  Laurent Falquet, SIB and EMBnet-CH for slides and information on SwissProt and Prosite