Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome Biology Program
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Databases Databases used for the analysis of biological molecules. Databases contain information organized in a way that allows users/researchers to retrieve and exploit it. Why bother? Store information. Organize data. Predict features (genes, functions...). Predict the functional role of a feature (annotation). Understand relationships (metabolic reconstruction).
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Overview Sequence databases Primary (contain “raw” data) Nucleotide Protein Secondary (processed information) Genes Proteins Classification databases Sequence classification Function classification Other methods Other specialized databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Primary nucleotide databases EMBL/GenBank/DDBJ ( ) Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices The sequences are exchanged between the three centers on a daily basis. Database is doubling every 10 months. Sequences from >140,000 different species. 1400 new species added every month. YearBase pairsSequences ,575,745,17640,604, ,037,734,46252,016, ,019,290,70564,893, ,874,179,73080,388, ,116,431,94298,868,465
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Primary protein sequence databases Contain coding sequences derived from the translation of nucleotide sequences GenBank Valid translations (CDS) from nt GenBank entries. UniProtKB/TrEMBL (1996) Automatic CDS translations from EMBL. TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Errors in databases There are a lot of errors in the primary sequence databases: In the sequences themselves: Sequencing errors. Cloning vectors sequences. For the annotations, the free submission of entries results to: Inaccuracies, omissions, and even mistakes. Inconsistencies between some fields.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Redundancy Redundancy is a major problem. Entries are partially or entirely duplicated: e.g. 20% of vertebrate sequences in GenBank. { { { { { { Partial and complete sequence duplications
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Overview Sequence databases Primary (contain “raw” data) Nucleotide Protein Secondary (processed information) Genes Proteins Classification databases Sequence classification Function classification Other methods Other specialized databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 NCBI Derivative Sequence Data ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 RefSeq Curated transcripts and proteins. reviewed by NCBI staff. Model transcripts and proteins. generated by computer algorithms. Assembled Genomic Regions (contigs). Chromosome records.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Secondary protein databases Uniprot/SWISS-PROT (1986) ( a curated protein sequence database high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.) a minimal level of redundancy high level of integration with other databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Overview Sequence databases Primary (contain “raw” data) Nucleotide Protein Secondary (processed information) Genes Proteins Classification databases Sequence classification Function classification Other methods Other specialized databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Groups (families/clusters) of proteins based on… Overall sequence similarity. Local sequence similarity. Presence / absence of specific features (active site, signal peptides… ). Structural similarity.... These groups contain proteins with similar properties. Specific function, enzymatic activity. General function. Evolutionary relationship. … Classification databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Overall sequence similarity
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages A function (specific or broad) has been assigned to each COG. Clusters of orthologous groups (COGs)
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Profiles & Pfam A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles). These domains/profiles can be used to detect distant relationships, where only few residues are conserved.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Regions similarity
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Pfam HMMs of protein alignments (local) for domains, or global (cover whole protein)
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 TIGRfam Full length alignments. Domain alignments. Equivalogs: families of proteins with specific function. Superfamilies: families of homologous genes. HMMs
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 KEGG orthology
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Composite pattern databases To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro Release 28.0 (Aug 10) contains 20837entries Central annotation resource, with pointers to its satellite dbs
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 * It is up to the user to decide if the annotation is correct *
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 ENZYME
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 ENZYME
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 KEGG Contains information about biochemical pathways, and protein interactions.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Overview Sequence databases Primary (contain “raw” data) Nucleotide Protein Secondary (processed information) Genes Proteins Classification databases Sequence classification Function classification Other methods Other specialized databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Sequencing projects GOLD Information for ongoing and finished (meta)genomic projects. Information about the metadata of genomes and metagenomic samples.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Literature search PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Specialized databases There is a large number of databases devoted to specific organisms. For some model organisms there are often concurrent systems. These databases are associated to sequencing or mapping projects.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Signal transduction, regulation, protein-protein interactions TRANSFAC (Transcription Factor database) BRITE (Biomolecular Relations in Information Transmission and Expression database) DIP (Database of Interacting Proteins) BIND (Biomolecular Interaction Network database) BioCarta Biochemical pathways KLOTHO (Biochemical Compounds Declarative database) BRENDA (enzyme information system) LIGAND (similar to Enzyme but with more information for substrates) Gene order and co-occurrence STRING Other specialized databases 3D structures PDB (Protein Data Bank) MMDB (Molecular Modelling Data Base) NRL_3D (Non-Redundant Library of 3D Structures) SCOP (Structural Classification of Proteins) Polymorphism ALFRED (Allele Frequency Database) Molecular interactions DIP (Database of Interacting proteins) BIND (Biomolecular Interaction Network Database) Gene expression GXD (Mouse Gene Expression Database) The Stanford Microarray Database Mapping GDB (Genome Data Base) EMG (Encyclopedia of Mouse Genome) MGD (Mouse Genome Database) INE (Integrated Rice Genome Explorer) Protein quantification SWISS-2DPAGE PDD (Protein Disease Database) Sub2D (B. subtilis 2D Protein Index)
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 List of databases
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Databanks interconnection Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Concluding remarks We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more… They help predict the function, or the network of functions. Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Thank you for your attention.