Introduction to Databases
INTRODUCTION
DATA Data is raw, unorganized facts that need to be processed. Example:- Each student's test score is one piece of data. INFORMATION When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Example:- score of a class or of the average entire school is information that can be derived from the given data.
Database A database is a collection of data in an organized manner, which is accessible in various ways. Biological Databases serve a critical purpose in the collection and organization of data related to biological systems. They provide a computational support and a user-friendly interface to a researcher for a meaningful analysis of biological data.
A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates.
WHAT ARE THE BIOLOGICAL DATABASES ???
Different classifications of databases Type of data nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways
Different classifications of databases…. Primary or derived databases Primary databases: experimental results directly into database Secondary databases: results of analysis of primary databases Aggregate of many databases Links to other data items Combination of data Consolidation of data
Different classifications of databases…. Availability Publicly available, no restrictions Available, but with copyright Accessible, but not downloadable Academic, but not freely available Proprietary, commercial; possibly free for academics
TYPES OF DATABASES Primary Databases Secondary Databases
PRIMARY DATABASES Contains bio-molecular data in its original form. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed. Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT and PIR for protein sequences and PDB for molecular structures.
GenBank http://www.ncbi.nlm.nih.gov /genbank/ Database from NCBI, includes sequences from publicly available resources.
NCBI and Entrez One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA) Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications and more. http://www.ncbi.nlm.nih.gov/
Genbank An annotated collection of all publicly available nucleotide and proteins Set up in 1979 at the LANL (Los Alamos). Maintained since 1992 NCBI (Bethesda).
GenBank file format
GenBank file format
EMBL European Molecular Biological Laboratory http://www.ebi.ac.uk / European Molecular Biological Laboratory Nucleic acid database from EBI (European Bioinformatics Institute) Produced in collaboration with DDBJ and GenBank Search engine – SRS (Sequence Retrieval System)
DDBJ DNA Databank of Japan http://www.ddbj.nig.ac.jp/ DNA Databank of Japan Started in 1986 in collaboration with GenBank Produced and maintained at NIG (National Institute of Genetics)
SWISS PROT Annotated sequence database established in 1986 http://www.ebi.ac.uk/uniprot/ Annotated sequence database established in 1986 Consists of sequence entries of different lie formats Similar format to EMBL http://us.expasy.org/sprot/sprot-top.html …...
PIR Protein Information Resource http://pir.georgetown.edu / Protein Information Resource A division of National Biomedical Research Foundation (NBRF) in U.S. One can search for entries or do sequence similarity search at PIR site.
TrEMBL Translated European Molecular Biology Laboratory http://www.ebi.ac.uk/trembl/ Translated European Molecular Biology Laboratory Computer annotated supplement of SWISS PROT. Contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS PROT.
Protein DataBank (PDB) Important in solving real problems in molecular biology Protein Databank PDB Established in 1972 at Brookhaven National Laboratory (BNL) Sole international repository of macromolecular structure data Moved to Research Collaboratory for Structural Bioinformatics http://www.rcsb.org/
PDB: example HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14EMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………
COMPOSITE DATABASES Collection of various primary database sequences Renders sequence searching highly efficient as it searches multiple resources Examples :- NRDB (Non Redundant Database), OWL, MIPSX, SWISS PROT + TrEMBL
SECONDARY DATABASES Contains data derived from the results of analysing primary data Manually created or automatically generated Contains more relevant and useful information structured to specific requirements Example :- PROSITE, PRINTS, BLOCKS, Pfam
PROSITE Families of proteins Can search using regular expressions Similar to unix commands Families exhibit these patterns So we can search over families
BLOCKS Motifs/blocks are created by automatically detecting the most conserved regions of each protein family.
PRIMARY VS SECONDARY DATABASES