Sequence databases and retrieval systems Guy Perrière [ replaced by Manolo Gouy ] Pôle Bio-Informatique Lyonnais Laboratoire de Biométrie et Biologie Évolutive.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Bioinformatique: Projets génome, prédiction de gènes, recherche de similarité Laurent Duret BBE – UMR CNRS n° 5558 Université Claude Bernard - Lyon 1 INSA.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Taller de Bioinformática
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
An Introduction to Bioinformatics Molecular Biology Databases.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Bioinformatics.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Biological Databases By : Lim Yun Ping E mail :
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Sequence Search and Analysis SPE 1653 (703)
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Copyright OpenHelix. No use or reproduction without express written consent1.
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
InterPro Sandra Orchard.
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Protein databases Henrik Nielsen
Demo: Protein Information Resource
Archives and Information Retrieval
생물정보학 Bioinformatics.
Genome Annotation Continued
Mangaldai College, Mangaldai
Genomes and Their Evolution
Introduction to Bioinformatics
Lesson 3 Bioinformatics Laboratory
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Sequence databases and retrieval systems Guy Perrière [ replaced by Manolo Gouy ] Pôle Bio-Informatique Lyonnais Laboratoire de Biométrie et Biologie Évolutive UMR CNRS n° 5558 Université Claude Bernard – Lyon 1

In the beginning First paper compilation in 1965 (Atlas of Protein Sequences). Development of real databanks at the begin- ning of the 80’s: Fast access. Make possible analyses that require a lot of data: –Codon usage. –Molecular phylogeny.

General databanks Nucleotide sequences: EMBL/GenBank/DDBJ. Protein sequences: Simple translations of coding regions: –GenPept (from GenBank). –TrEMBL (from EMBL). Systems containing additional data: –SWISS-PROT. –PIR.

EMBL Created in 1980 at the European Molecular Biology Laboratory in Heidelberg. Maintained since 1994 at the European Bioinformatics Institute (EBI) near Cambridge. Web server:

GenBank Set up in 1979 at the Los Alamos National Laboratory in New Mexico, US. Maintained since 1992 at the National Cen- ter for Biotechnology Information (NCBI) in Bethesda. Web server:

DDBJ Active since 1984 at the National Institute of Genetics (NIG) in Mishima, Japan. Web server:

EMBL / GenBank / DDBJ The International Nucleotide Sequence Database Collaboration : EMBL / GenBank / DDBJ New sequences are exchanged daily between the three centers : --> the three banks have an identical content. Data mainly provided by direct submissions from the authors through Internet: Web forms. .

Data growth log (number of residues)

GenBank/EMBL size (April 2003) 31  10 9 nucleotides. 24  10 6 sequences. 1.8 million genes (proteins and RNA). 313,000 bibliographic references. 100 gigabytes on disk. Growth of 63 % in 12 months.

Taxonomic sampling (April 2003) There are 135,560 species for which at least one sequence is available. Nine species (0.007 %) correspond to 62 % of the total. 77,900 species are represented by only one sequence! Homo sapiens Mus musculus Zea mays Rattus norvegicus Brassica oleracea Arabidopsis thaliana Danio rerio Drosophila melanogaster Oryza sativa 27.3% 20.1% 3.0 % 2.9 % 2.3 % 2.0 % 1.4 % 0.9 % The nine most represented species in GenBank/EMBL

Distribution format The banks are distributed as a set of text files called divisions ( 292 for EMBL). A division contains sequences related to: A taxon (e.g., bacteria, invertebrates, mammals). A class of sequences (EST, HTG, GSS). Within a division, each sequence is called an entry.

Entry structure Information is introduced in structured fields. The format differs in its form between EMBL and GenBank/DDBJ … but not in substance.

ID, AC, SV and DT fields Contain identifiers and the creation and the last modification dates for the entries. ID BSAMYL standard; DNA; PRO; 2680 BP. XX AC V00101; J01547 XX SV V XX DT 13-JUL-1983 (Rel. 03, Created) DT 12-NOV-1996 (Rel. 49, Last updated, Version 11)

DE, KW, OS and OC fields Definition, Keywords, Taxonomy. DE Bacillus subtilis amylase gene. XX KW amyE gene; amylase; amylase-alpha; KW regulatory region; signal peptide. XX OS Bacillus subtilis OC Bacteria; Firmicutes; Bacillus/Clostridium group; OS Bacillus/Staphylococcus group; Bacillus. The NCBI maintains a unified taxonomy, largely based on sequence information.

RN, RX, RA and RT fields contain bibliographic information. RN [1] RP RX MEDLINE; RA Yang M., Galizzi, A., Henner, D.J.; RT "Nucleotide sequence of the amylase gene from RT Bacillus subtilis"; RL Nucleic Acids Res. 11: (1983). …

FT field contains the descriptions of functional regions. key location and qualifiers FT promoter FT /note="put. promoter sequence P2 [3] (amyR1)" FT RBS FT /note="rRNA-binding site rbs-1 [3]" FT CDS FT /gene="amyE" FT /db_xref="SWISS-PROT:P00691" FT /product="alpha-amylase precursor" FT /EC_number=" ” FT /protein_id="CAA " FT /translation="MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMKDIHDAG...

Intron/exon structure FT CDS join( , , ) FT /codon_start=1 FT /db_xref="SWISS-PROT:P01308" FT /note="precursor" FT /gene="INS" FT /product="insulin"... Sequence Subsequence

SQ field Contains the sequence iself SQ Sequence 2680 BP; 825 A; 520 C; 642 G; 693 T; 0 other; gctcatgccg agaatagaca ccaaagaaga actgtaaaaa cgggtgaagc agcagcgaat 60 agaatcaatt gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca gggggattcc 120 ccatacattc ttcgcttggc tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct 180 gtttctgctt cggtatgtga ttgtgaagct ggcttacaga agagcggtaa aagaagaaat 240 (...) gatggtttct tttttgttca taaatcagac aaaacttttc tcttgcaaaa gtttgtgaag 2580 tgttgcacaa tataaatgtg aaatacttca caaacaaaaa gacatcaaag agaaacatac 2640 cctgcaagga tgctgatatt gtctgcattt gcgccggagc 2680 //

Errors in databanks There are a lot of errors in the nucleotide sequence databanks: In annotations: –Inaccuracies, omissions, and even mistakes. –Inconsistencies between entries. In the sequences themselves: –Sequencing errors. –Cloning vectors inserted.

Redundancy Another major pro- blem is redundancy. A lot of entries are partially or entirely duplicated: 20% of vertebrate se- quences in GenBank. Duplicated entries are often different in their sequence.  { {  { {  { { Partial and complete sequence duplications

Variations in duplicates It is often impossible to decide whether a difference between two duplicates is due to: Polymorphism. Sequencing error. True gene duplication. And what to do when annotations differ or are even contradictory?

Protein sequence databases Translation of Coding DNA Sequences (CDS) from EMBL/GenBank/DDBJ. Consultation of publications or patents. Very small number of direct protein sequence submission by authors. In SwissProt and PIR: additional annotations.

SWISS-PROT Created by Amos Bairoch in 1986 at the Department of Medical Biochemistry in Geneva. Maintained by the Swiss Institute of Bioinformatics (SIB) and funded by GeneBio, and, very recently, by NIH. Web server:

SWISS-PROT characteristics Almost no redundancy. Cross-references with 60 other databanks. High-quality annotations: Systematic control by a team of annotators. Help from a set of > 200 volunteer experts. Embedded in Expasy, a www proteomics server ( ).

Annotations Protein function. Post-translational modifications. Structural or functional domains. Secondary and quaternary structures. Similarities with other proteins. Conflicts between positions for CDS. Disease-related mutations

Associated databanks TrEMBL, built using only annotated CDS from the EMBL data library. ENZYME, for the international enzyme nomenclature. PROSITE, for biologically significant sites, patterns and profiles. SWISS-2DPAGE, for two-dimensional polyacrylamide gel electrophoresis maps.

PIR PIR (The Protein Information Resource) was created by Margaret Dayhoff in Aims: To provide exhaustive and non-redundant protein sequence data. To give a classification using taxonomic and similarity data: entries grouped in super-families, families and subfamilies.

Data maintenance Three organisms collect and organize the data introduced in PIR: The National Biomedical Research Foundation (NBRF) in the United States. The Martinsried Institute for Protein Sequence (MIPS) in Germany. The Japan International Protein Sequence Information Database (JIPID) in Japan.

Results The exhaustivity is not better than what is obtained with SWISS-PROT+TrEMBL. Still contains redundancy. Less comprehensive annotation. Low number of cross-references. PIR has recently joined forces with EBI and SIB to establish the UniProt (United Protein Databases), the central resource of protein sequence and function.

Specialized databanks A lot of specialized databanks have been developed, which are devoted to: Complete genomes. Families of homologous genes. Non-sequence data. These systems are under the responsibility of curators: Data quality and homogeneity control.

Complete genomes There is a large number of databanks devoted to specific organisms. These banks are associated to sequencing or mapping projects. For some model organisms there are often several concurrent systems.

Examples Available databanks NRSub (Non-Redundant B. subtilis) SubtiList Colibri EcoGene (E. coli Gene Database) ECDC (E. coli Database Collection) CMR (Comprehensive Microbial Resource) EMGLib (Enhanced Microbial Genomes Library) Micado (Microbial Advanced Database Organization) MYGD (MIPS Yeast Genome Database) SGD (Saccharomyces Genome Database) YPD (Yeast Proteome Database) FlyBase PlasmoDB (P. falciparum Database) WormBase WormPD (Worm Protein Database) TAIR (The Arabidopsis Information Resource) Organism Bacillus subtilis Escherichia coli Various prokaryotes Saccharomyces cerevisiae Drosophila melanogaster Plasmodium falciparum Caenorhabditis elegans Arabidopsis thaliana

Gene family databanks Built with automated procedures: Similarity search between sets of proteins (BLASTP, FASTP, Smith-Waterman). Clustering into homologous families using similarity criteria. Include various data: Protein (and sometimes nucleotide) sequences. Multiple sequence alignments and trees. Taxonomy.

ProtFam Developed at MIPS. Built with PIR sequences. Includes four levels of classification: Superfamilies (based on function and similarity criteria). Families (50% similarity). Subfamilies (80% similarity). Entries (≥95% similarity).

ProtFAm characteristics Allows to visualize alignments and dendrograms for the families. Integrates Pfam domains. Allows users to classify their own protein sequences. Web server:

ProtoMap Initially developed at the Hebrew University of Jerusalem ; now hosted at Cornell University. Built with SWISS-PROT & TrEMBL sequences. Combines 3 sequence similarity measures (BLASTP, FASTA and Smith-Waterman).

ProtoMap characteristics Alignments and trees are visualized with Java applets. Users can submit sequences and classify them. Web server:

Specialized systems HOVERGEN (Homologous Vertebrate Genes Database) : Based on GenBank CDS. HOBACGEN (Homologous Bacterial Genes Database) for prokaryotes and yeast: Based on SWISS-PROT/TrEMBL. HOBACGEN-CG for completely sequenced genomes: Based on SWISS-PROT/TrEMBL.

Other specialized systems COG (Clusters of Orthologous Groups), also for complete genomes: Based on GenBank CDS. NuReBase (Nuclear Receptors Database) for mammalian nuclear receptors: Based on EMBL CDS. RTKdb (Tyrosine Kinase Receptors): Based on EMBL CDS.

Q9KPJ1 GLT1_YEAST Q9VVA4 Q GLTS_SYNY3 O67512 Q9PA10 AAG08421 P95456 GLTB_ECOLI Q9RXX2 Q9PJA4 GLTB_SYNY3 GLTB_BACSU Q9KC Q9KPJ4 P96218 Q9S2Y Are COGs real orthologs? Reciprocal best BLAST hit Glutamate synthase large subunit Escherichia coli Bacillus subtilis Pseudomonas aeruginosa Vibrio cholerae Synechocystis sp.

Beyond protein families ProtFam, Hovergen, Hobacgen, COGs gather protein sequences homologous on their whole length Patterns, profiles, domains, … are covered in Terry Attwood’s lecture.

HOBACGEN Integrates protein and nucleotide sequences as well as multiple alignments and trees. Is based upon a client/server architecture. Client software is distributed as well as the server structure (including all sequences). Web server:

Similarities search BLASTP2 BLOSUM62 E ≤ SEG SWISS-PROT/TrEMBL sequences  Local alignments for sequence pairs

Segments selection S2S4S1S3 Seq. A Seq. B S2S1’ ∆lg1lgHSP1∆lg2 ∆lg3 lgHSP2 Seq. A Seq. B

Families assembly A B A C HSP ≥ 80% length Similarity ≥ 50% Simple link inclusion C B A Grouping of A, B and C

Alignments and trees Unaligned family ABCDEFGABCDEFG BIONJ Observed divergence Aligned family ABCDEFGABCDEFG Rooting by mid-point Family tree G F E D C B A CLUSTAL W Default parameters

Domains and Families Proteins can be made of very different sets of domains

Site, Motif, Domain Simple motifs Complex motifs Alignments of whole domains Profiles (PROSITE) HMM (Pfam) Patterns (PROSITE) Fingerprint : series of aligned motifs (PRINTS) Ungapped alignment of segments (BLOCKS)

ProDom : defining domain structure 6PG1_YEAST 6PGD_CANAL 6PGD_SOYBN 6PG2_BACSU O32911_MYCLR P95165_MYCTU 6PGD_CERCA Q40311_MEDSA Y770_MYCTU Y229_SYNY3 ProDom domains for the 6PGD family

InterPro InterPro unifies PROSITE, PRINTS, Profile, ProDom, Pfam, SMART, and TIGRFam. InterPro pfam prodom smart prints prosite

An InterPro entry Accession IPR Name Bacterial rhodopsin Type Family Dates 08-OCT-1999 (created) 28-FEB-2000 (last modified) Signatures PROSITE PS00327 BACTERIAL_OPSIN_RET PROSITE PS00950 BACTERIAL_OPSIN_1 PRINTS BACTRLOPSIN PFAM PF01036 Bac_rhodopsin Abstract The bacterial opsins are retinal-binding proteins that provide light-dependent ion transport and sensory functions to a family of halophilic bacteria [1, 2] ]. They are integral membrane proteins believed to contain seven transmembrane (TM) domains, the last of which contains the attachment point for retinal (a conserved lysine).... Examples * Q48315 BACH_HALHP: Halorhodopsin * Q53496 BACR_HALSR: Cruxrhodopsin * P15647 BACH_NATPH * P96787 BAC3_HALSD: Archaearhodopsin View examples...

Non-sequence data Available systems GXD (Mouse Gene Expression Database) The Stanford Microarray Database GDB (Genome Data Base) EMG (Encyclopedia of Mouse Genome) MGD (Mouse Genome Database) INE (Integrated Rice Genome Explorer) SWISS-2DPAGE PDD (Protein Disease Database) Sub2D (B. subtilis 2D Protein Index) PDB (Protein Data Bank) MMDB (Molecular Modelling Data Base) NRL_3D (Non-Redundant Library of 3D Structures) SCOP (Structural Classification of Proteins) ALFRED (Allele Frequency Database) DIP (Database of Interacting proteins) BIND (Biomolecular Interaction Network Database) Data Gene expression Mapping Protein quantification 3D structures Polymorphism Molecular interactions

Sequence Data retrieval Made mainly through Internet access: With client software (e.g., Entrez, HobacFetch). By remote connections to servers providing on- line access to the banks (INFOBIOGEN). Using World-Wide Web servers and browsers

Advantages and limitations Users do not have to cope with the usual databases problems: Storing of large amounts of data. Daily updates. Software upgrades. Simplicity of use. Net access is sometimes very slow at peak hours: consider using other servers besides NCBI

The ACNUC retrieval system Direct access to functional regions described in feature tables (CDS, tRNA, rRNA). Selection of entries using various criteria: Sequence names and accession numbers. Bibliographic criteria. Keywords. Taxonomy. Organelle. Developed at Lyon University

ACNUC : possible accesses Graphical interface distributed along with the databases themselves. Web access at Pôle Bio-Informatique Lyonnais (PBIL):

ACNUC characteristics Allows to query any bank in PIR, SWISS- PROT, EMBL, or GenBank formats. Keywords and species browsing. Complex queries. Links with sequence analysis programs on the Web server (alignment, codon usage).

click

The Query form

click Building queries to the sequence data bases

click

Locally save the received sequence data. Retrieving sequences

Browsing the species trees

HOVERGEN: Families of homologous vertebrate genes

Access to family members Download tree or alignment

SRS Public version developed at EMBL by Etzold and Argos (1993). Presently available on the different Web servers belonging to EMBnet: EBI (England). INFOBIOGEN (France). DKFZ (Germany). …

Characteristics Database index built with the use of ODD (Object Design and Definition). More than 250 databanks have been indexed and are accessible through 35 SRS servers. Allows queries to operate simultaneously on different banks.

Databanks interconnection

Entrez Developed by Schuler et al. (1996) at NCBI. Allows to query several US-made databases: GenBank, GenPept, NR, MMDB, MEDLINE. Access through client software (Unix, Mac or Windows) or Web server:

Characteristics Introduces the concept of neighbours between sequences, references and structures. Sequence neighbours are established using similarity criteria. No access to multiple alignments. Phylogeny (Taxman) Structures (MMDB) Refs. (PubMed) Complete Genomes Nucl. Seq. (GenBank) Prot. Seq. (GenPept)

NAR 2003 database issue