Biological databases: Collection, storage and maintenance

Slides:



Advertisements
Similar presentations
Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Databases (“knowledge bases”) used in genome analysis
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Lecture 7 Types of databases.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel:
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Bioinformatics & LIS A brief talk for librarians, information scientists, and computer scientists about resources and collaborative opportunities with.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Introductory Overview
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Function preserves sequences
NCBI Literature Databases: PubMed
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to PubChem BioAssay
Databases and DBMSs Todd S. Bacastow January
Protein databases Henrik Nielsen
Biological Databases By: Komal Arora.
Archives and Information Retrieval
생물정보학 Bioinformatics.
What is Bioinformatics?
Functional Annotation of the Horse Genome
Mangaldai College, Mangaldai
Genomes and Their Evolution
Instructor: Kritika Karri
Introduction to Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
LESSON 1 INTNRODUCTION HYE-JOO KWON, Ph.D /
Introduction to Databases
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Biological Databases.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological databases: Collection, storage and maintenance Biological Database as a collection of data that is structured, searchable, updated periodically, and cross-referenced

Biological databases: Collection, storage and maintenance Heterogeneous content ~ Complex data type (Text base sequence, Blobs, images of cells and tissue , 3-D molecular structure, biochemical pathway, model data , scalar and vector fields Hierarchical data organization Dynamic nature Accessibility Quality

The first database was of proteins Atlas of Protein Sequence and Structure (1965) edited by Margaret Dayhoff . It contains protein sequence that published at that time (Foundation of PIR) Yeast t-RNA with 77 bases was first nucleotide sequence data base Protein structural data base with 10 entries was first constructed in 1972. First genome data base was published on 1995 with that Haemophilus influenzae

~100 GB

162886727 loci, 150,141,354,858 bases, from 162,886,727 sequences as of 15th Feb 2013

Categories of Databases Data Type (Data heterogeneity) Maintainer Status Technical Design Data Source Data Access And/or other parameter

1. Categories of Databases: Data Type Taxonomy Database Genome Database Sequence database Structure Database Proteomic Database Micro-array Database Enzyme Database Disease Database Pathway Database Literature Database… Many More

Nucleotide Databases Nucleotide Databases dbEST PopSet dbGSS Probe dbSNP RefSeq dbSTS TPA Nucleotide Trace Archive GenBank UniGene HomoloGene UniSTS MGC

Protein Databases 3D Domains PROW Proteins RefSeq Protein Clusters Structure Databases Conserved Domains Structure (MMDB) 3D Domains Taxonomy Databases Taxonomy Genome Databases Cancer Chromosomes Genome Project COGs Genomes Gene

Expression Databases GEO Profiles SAGE GEO Datasets Chemical Databases PubChem BioAssay PubChem Compound PubChem Substance

2. Categories of Databases: Maintainer Status NCBI (Federal Govt. agency of USA) (http://www.ncbi.nlm.nih.gov/) EBI/EMBL(Non-profit academic organization) (http://www.ebi.ac.uk/) SIB (Quasi-academic non-profit foundation) (http://www.isb-sib.ch)

http://www.ncbi.nlm.nih.gov/

3. Categories of Databases: Technical Design Flat file (Information store in text files) XML (Extensible markup language) (Hierarchical semi-structured model) Relational model (Highly structured model) (It has tables with rows (tuples or record) and columns (field) supports by RDBMS like SQL, Oracle, DB2) Object-oriented database management system ASN.1 (abstract syntax notation)

This information is organised into a tabular form, as it usually done in relational DB. The number of columns (fields) in such DB is much larger than in the table below. An index of these fields can be made, which allows very fast search of a DB using one or few field simultaneously. The information in one DB can be cross-referenced to that in another DB. For instance DNA, protein and reference DB have all been cross-references so that moving between them is readily accomplished. Accession Organism Reference Name Keywords Sequence No 123 E. coli Medline1, LexA SOS regulon, ATGCCGG… protein repressor,…   124 H. sapiens Medline2, glucorticoid transcriptional CCGATAAC receptor regulator

Example of object-oriented DB

Comparison Structure Advantages Disadvantages Flat File Fast data retrieval, Simple structure, easy programming Difficult to process multiple value, adding new data require reprogramming, slow without the key Hierarchical Addition and deletion easy, fast retrieval through higher level records, multiple association with like records Pointer require large computer storage, pointer path restricts access, each association requires repetitive data Relational Easy access, minimal training for users, flexible for unforeseen enquiry, easy modification, physical storage of data can be changed without affecting the relationship Sequential access is slow, prone to logical mistakes, method of storage impact processing time, new relation require considerable processing Comparison

Database Data Data format Data type GenBank OMIM DNA/RNA seq, phynotype, genotype Text file/ASN.1 Text, Numeric Text file GDB AceDB Genetic map Relational/MySQL Object oriented Medline NCBI Literature Seq, str, literature ASN.1 Text PDB BLAST ClustalW KEGG Microarray Structure Seq, Analysis Metabolic path Microarray data Oracle Fasta HTML text, binary RDBMS, Excel 3D Image Images, text

4. Categories of Databases: Data Source Type -1 Primary (From experimental sources) Nucleic acid sequence, protein sequence, protein structure Secondary (From already existing primary database) Genomic (TiGR human gene index), Proteomic (Prosite, CATH) Type -2 Nucleic acids Literature (pubmed) Biomacromolecules Pathways

DNA Sequence Database National Center for Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov DNA Databank of Japan (DDBJ) http://www.ddbj.nig.ac.jp European Molecular Biology Laboratory (EMBL) http://www.embl-heidelberg.de

Protein sequence Database

European Bioinformatics Institute Swiss Institute of Bioinformatics Georgetown University

Exchange data on a hourly basis International Nucleotide Sequence Database Collaboration (INSD). Exchange data on a hourly basis Mirroring Data backup

Protein structure Database http://www.rcsb.org/pdb/index.html

PDB

PDB

Secondary database

http://rebase.neb.com/rebase/rebase.html

5. Categories of Databases: Data Access Publicly available Available with copyright Browsing but not downloadable Academic but not free Commercial access with payment

6. Categories of Databases: Others Completeness Curation (annotation) …..

ENTREZ DB of different kind merged together and become global hubs of knowledge.

1. Nucleotide Sequence Databases 2. RNA sequence databases 3. Protein sequence databases 4. Structure Databases 5. Genomics Databases (non-human) 6. Metabolic Enzymes and Pathways; Signaling Pathways 7. Human and other Vertebrate Genomes 8. Human Genes and Diseases 9. Microarray Data and other Gene Expression Databases 10. Proteomics Resources 11. Other Molecular Biology Databases

For a detailed list and full coverage see http://nar.oxfordjournals.org/content/41/D1.toc

NCBI resources Databases Online analysis tools

Entrez @ http://www.ncbi.nlm.nih.gov/

Sequence Retrieval System (http://srs.ebi.ac.uk)