Biological databases: Collection, storage and maintenance Biological Database as a collection of data that is structured, searchable, updated periodically, and cross-referenced
Biological databases: Collection, storage and maintenance Heterogeneous content ~ Complex data type (Text base sequence, Blobs, images of cells and tissue , 3-D molecular structure, biochemical pathway, model data , scalar and vector fields Hierarchical data organization Dynamic nature Accessibility Quality
The first database was of proteins Atlas of Protein Sequence and Structure (1965) edited by Margaret Dayhoff . It contains protein sequence that published at that time (Foundation of PIR) Yeast t-RNA with 77 bases was first nucleotide sequence data base Protein structural data base with 10 entries was first constructed in 1972. First genome data base was published on 1995 with that Haemophilus influenzae
~100 GB
162886727 loci, 150,141,354,858 bases, from 162,886,727 sequences as of 15th Feb 2013
Categories of Databases Data Type (Data heterogeneity) Maintainer Status Technical Design Data Source Data Access And/or other parameter
1. Categories of Databases: Data Type Taxonomy Database Genome Database Sequence database Structure Database Proteomic Database Micro-array Database Enzyme Database Disease Database Pathway Database Literature Database… Many More
Nucleotide Databases Nucleotide Databases dbEST PopSet dbGSS Probe dbSNP RefSeq dbSTS TPA Nucleotide Trace Archive GenBank UniGene HomoloGene UniSTS MGC
Protein Databases 3D Domains PROW Proteins RefSeq Protein Clusters Structure Databases Conserved Domains Structure (MMDB) 3D Domains Taxonomy Databases Taxonomy Genome Databases Cancer Chromosomes Genome Project COGs Genomes Gene
Expression Databases GEO Profiles SAGE GEO Datasets Chemical Databases PubChem BioAssay PubChem Compound PubChem Substance
2. Categories of Databases: Maintainer Status NCBI (Federal Govt. agency of USA) (http://www.ncbi.nlm.nih.gov/) EBI/EMBL(Non-profit academic organization) (http://www.ebi.ac.uk/) SIB (Quasi-academic non-profit foundation) (http://www.isb-sib.ch)
http://www.ncbi.nlm.nih.gov/
3. Categories of Databases: Technical Design Flat file (Information store in text files) XML (Extensible markup language) (Hierarchical semi-structured model) Relational model (Highly structured model) (It has tables with rows (tuples or record) and columns (field) supports by RDBMS like SQL, Oracle, DB2) Object-oriented database management system ASN.1 (abstract syntax notation)
This information is organised into a tabular form, as it usually done in relational DB. The number of columns (fields) in such DB is much larger than in the table below. An index of these fields can be made, which allows very fast search of a DB using one or few field simultaneously. The information in one DB can be cross-referenced to that in another DB. For instance DNA, protein and reference DB have all been cross-references so that moving between them is readily accomplished. Accession Organism Reference Name Keywords Sequence No 123 E. coli Medline1, LexA SOS regulon, ATGCCGG… protein repressor,… 124 H. sapiens Medline2, glucorticoid transcriptional CCGATAAC receptor regulator
Example of object-oriented DB
Comparison Structure Advantages Disadvantages Flat File Fast data retrieval, Simple structure, easy programming Difficult to process multiple value, adding new data require reprogramming, slow without the key Hierarchical Addition and deletion easy, fast retrieval through higher level records, multiple association with like records Pointer require large computer storage, pointer path restricts access, each association requires repetitive data Relational Easy access, minimal training for users, flexible for unforeseen enquiry, easy modification, physical storage of data can be changed without affecting the relationship Sequential access is slow, prone to logical mistakes, method of storage impact processing time, new relation require considerable processing Comparison
Database Data Data format Data type GenBank OMIM DNA/RNA seq, phynotype, genotype Text file/ASN.1 Text, Numeric Text file GDB AceDB Genetic map Relational/MySQL Object oriented Medline NCBI Literature Seq, str, literature ASN.1 Text PDB BLAST ClustalW KEGG Microarray Structure Seq, Analysis Metabolic path Microarray data Oracle Fasta HTML text, binary RDBMS, Excel 3D Image Images, text
4. Categories of Databases: Data Source Type -1 Primary (From experimental sources) Nucleic acid sequence, protein sequence, protein structure Secondary (From already existing primary database) Genomic (TiGR human gene index), Proteomic (Prosite, CATH) Type -2 Nucleic acids Literature (pubmed) Biomacromolecules Pathways
DNA Sequence Database National Center for Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov DNA Databank of Japan (DDBJ) http://www.ddbj.nig.ac.jp European Molecular Biology Laboratory (EMBL) http://www.embl-heidelberg.de
Protein sequence Database
European Bioinformatics Institute Swiss Institute of Bioinformatics Georgetown University
Exchange data on a hourly basis International Nucleotide Sequence Database Collaboration (INSD). Exchange data on a hourly basis Mirroring Data backup
Protein structure Database http://www.rcsb.org/pdb/index.html
PDB
PDB
Secondary database
http://rebase.neb.com/rebase/rebase.html
5. Categories of Databases: Data Access Publicly available Available with copyright Browsing but not downloadable Academic but not free Commercial access with payment
6. Categories of Databases: Others Completeness Curation (annotation) …..
ENTREZ DB of different kind merged together and become global hubs of knowledge.
1. Nucleotide Sequence Databases 2. RNA sequence databases 3. Protein sequence databases 4. Structure Databases 5. Genomics Databases (non-human) 6. Metabolic Enzymes and Pathways; Signaling Pathways 7. Human and other Vertebrate Genomes 8. Human Genes and Diseases 9. Microarray Data and other Gene Expression Databases 10. Proteomics Resources 11. Other Molecular Biology Databases
For a detailed list and full coverage see http://nar.oxfordjournals.org/content/41/D1.toc
NCBI resources Databases Online analysis tools
Entrez @ http://www.ncbi.nlm.nih.gov/
Sequence Retrieval System (http://srs.ebi.ac.uk)