1 Databases in Bioinformatics (Roald Forsberg)
2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database Management Systems –Accessing databases Types of databases –Data types –Integrated databases (Entrez) Nucleotide sequence formats –FASTA format –GenBank format –XML formats
3 Databases in Bioinformatics Bioinformatics – attempted definition: “The application of computational techniques to understand and organise the information associated with biological macromolecules” Adapted from Oxford English Dictionary Biological experiments Databases Computational Biology
4 Ask your neighbour What would you like to do with a database? Which types of biological information could be stored in a database?
5 Use of databases Homology searching: –Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. –Sequence level – position, annotation –Structural level – proteins, RNA Evolutionary analyses: –Phylogenetics –Population genetics –Molecular evolution of genetic elements –Genome evolution Primer design Microarray design Drug design Many more……
6 General types of databases Primary –Raw and non-processed data Secondary –Curated – data chosen from criteria –E.g non-redundance, fold Tertiary –Data processed –HMM profile
7 Structure of relational databases MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ Entries Table 1 Table 2 Table = genetic element Field = position = chr. 4 Field = size = 3540 bp Field = coding = true Field = known EST = true Field = known structure = false
8 Structure of relational databases File Database files Database Management system Interface (WEB) File Terminal input scripts DBMS software SQL language (Structural Query Language) Terminal output Stored results Queries To DMBS Browser input scripts Results from DMBS Queries To data Structure of data Browser output Result files Results from DMBS
9 Database management systems A software package designed to store and manage databases. A computerized record-keeping system Allows operations such as: –Adding new files –Inserting data into existing files –Retrieving data from existing files –Changing data –Deleting data –Removing existing files from the database
10 Accessing a database WEB – graphical user interface (GUI) WEB – automated procedures –Batch search with script (Entrez) –Search robots with updates Local –Buy a big computer and a thick cable –Speed improvement
11 Protein sequence databases Database URL Protein sequence (primary) SWISS-PROT PIR-Internationalwww.mips.biochem.mpg.de/proj/protseqdbwww.mips.biochem.mpg.de/proj/protseqdb Protein sequence (composite) OWL NRDB Protein sequence (secondary) PROSITE PRINTS Pfam
12 Nucleotide sequence databases GenBankwww.ncbi.nlm.nih.gov/Genbankwww.ncbi.nlm.nih.gov/Genbank EMBLwww.ebi.ac.uk/emblwww.ebi.ac.uk/embl DDBJwww.ddbj.nig.ac.jpwww.ddbj.nig.ac.jp
13 Types of nucleotide data cDNA –Reversely transcribed from mRNA Genomic sequences –Directly sequenced from DNA strings of various species EST’s –a tiny portion of an entire gene derived from mRNA
14 Macromolecular structure databases Protein Data Bank (PDB) Nucleic Acids Database (NDB) PDBsumwww.biochem.ucl.ac.uk/bsm/pdbsumwww.biochem.ucl.ac.uk/bsm/pdbsum CATH SCOPhttp://scop.mrc-lmb.cam.ac.uk/scop/ FSSPwww.embl-ebi.ac.uk/dali/fsspwww.embl-ebi.ac.uk/dali/fssp
15 Molecular interaction databases General –Biomolecular Interaction Network Databasehttp://bioinfo.mshri.on.ca/cgi-bin/bind/datamanhttp://bioinfo.mshri.on.ca/cgi-bin/bind/dataman –Molecular interactions Database (MINT) Protein-Protein interactions –Database of interacting proteins Biochemical pathways –KEGG Metabolic Pathwayshttp://
16 Proteomics databases Yeast Proteome Database SWISS-2DPAGE TMIG-2DPAGE
17 Genome databases Entrez genomes Ensemble genomes HIV Sequence Database FlyBasehttp://flybase.bio.indiana.edu/ COGswww.ncbi.nlm.nih.gov/COGwww.ncbi.nlm.nih.gov/COG
18 Integrated databases Increasing the value of information InterProwww.ebi.ac.uk/interprowww.ebi.ac.uk/interpro Sequence retrieval system (SRS) Entrezwww.ncbi.nlm.nih.gov/Entrezwww.ncbi.nlm.nih.gov/Entrez
19 Entrez Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure The (ever) Expanding Entrez System
20 EBI services
EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates The International Sequence Database Collaboration
22 A closer look at GenBank Maintained by NCBI Accessed through Entrez Entrez Synchonized with DDBJ and EMBL
23 Sequence file formats Ideally – a stringent, easy to parse, specified format to facilitate the dissemination of information Reality – a plethora of coincidental and badly specified formats Different levels of information Some common formats –FASTA –GenBank –PHYLIP (PHYLIP package and others) –Nexus (PAUP package, MacClade and others) –Up and coming: XML Simple – sequence and name attribute Advanced – several attributes
24 FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTT GLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKT VLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLP KERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYL NNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE TISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXX XXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK
25 GenBank format A verbose but very informative format Contains much information in carefully specified format Harder to parse than FASTA
26 eXtensible Markup Language (XML) Markup language for data-representation – derived from SGML, sib of HTML Stringent simple language with rigid rules Human readable and versatile Good parsers exists for multiple platforms The ability to design own Document Type Definitions that parsers can use to validate a document permits complex data structures and grammars Examples of use for sequence data: –NCBI GBSeqXMLNCBI GBSeqXML –NCBI TinySeqXMLNCBI TinySeqXML
27 Links