1 Databases in Bioinformatics (Roald Forsberg). 2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database.

Slides:



Advertisements
Similar presentations
Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
HCS806 “Methods in Horticulture and Crop Science” Introduction to methods in Bioinformatics for plant science. David Francis (Coordinator) Ian Holford.
Other biological databases. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Archives and Information Retrieval
Biological databases.
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 10 Managing a Database.
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Introductory Overview
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Bioinformatics Dr. Víctor Treviño BT4007
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
GBIO Bioinformatics Introduction to DB. Instructors Practical sessions Kyrylo Bessonov (Kirill) Office: B37 1/16 Office hours:
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
EB3233 Bioinformatics Introduction to Bioinformatics.
Application of Bioinformatics in Genetic Research Instructors: Dr. Henry Baker Dr. Luciano Brocchieri Dr. Michele Tennant Dr. Lei Zhou
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
EBI is an Outstation of the European Molecular Biology Laboratory. PDBe Search Services (PDBelite, PDBePro and BIObar) Sanchayita Sen, Ph.D. PDB Depositions.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
E-utilities: Short course. The Entrez Query System at NCBI.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
BASICS OF BIOINFORMATICS Biotechnology Division North-East Institute of Science & Technology (Council of Scientific & Industrial Research) Jorhat ,
Biological databases: Collection, storage and maintenance
Archives and Information Retrieval
생물정보학 Bioinformatics.
Mangaldai College, Mangaldai
Introduction to Bioinformatics
Explore Evolution: Instrument for Analysis
Introduction to Databases
Supporting High-Performance Data Processing on Flat-Files
Biological Databases.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 Databases in Bioinformatics (Roald Forsberg)

2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database Management Systems –Accessing databases Types of databases –Data types –Integrated databases (Entrez) Nucleotide sequence formats –FASTA format –GenBank format –XML formats

3 Databases in Bioinformatics Bioinformatics – attempted definition: “The application of computational techniques to understand and organise the information associated with biological macromolecules” Adapted from Oxford English Dictionary Biological experiments Databases Computational Biology

4 Ask your neighbour What would you like to do with a database? Which types of biological information could be stored in a database?

5 Use of databases Homology searching: –Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. –Sequence level – position, annotation –Structural level – proteins, RNA Evolutionary analyses: –Phylogenetics –Population genetics –Molecular evolution of genetic elements –Genome evolution Primer design Microarray design Drug design Many more……

6 General types of databases Primary –Raw and non-processed data Secondary –Curated – data chosen from criteria –E.g non-redundance, fold Tertiary –Data processed –HMM profile

7 Structure of relational databases MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ MEQ Entries Table 1 Table 2 Table = genetic element Field = position = chr. 4 Field = size = 3540 bp Field = coding = true Field = known EST = true Field = known structure = false

8 Structure of relational databases File Database files Database Management system Interface (WEB) File Terminal input scripts DBMS software SQL language (Structural Query Language) Terminal output Stored results Queries To DMBS Browser input scripts Results from DMBS Queries To data Structure of data Browser output Result files Results from DMBS

9 Database management systems A software package designed to store and manage databases. A computerized record-keeping system Allows operations such as: –Adding new files –Inserting data into existing files –Retrieving data from existing files –Changing data –Deleting data –Removing existing files from the database

10 Accessing a database WEB – graphical user interface (GUI) WEB – automated procedures –Batch search with script (Entrez) –Search robots with updates Local –Buy a big computer and a thick cable –Speed improvement

11 Protein sequence databases Database URL Protein sequence (primary) SWISS-PROT PIR-Internationalwww.mips.biochem.mpg.de/proj/protseqdbwww.mips.biochem.mpg.de/proj/protseqdb Protein sequence (composite) OWL NRDB Protein sequence (secondary) PROSITE PRINTS Pfam

12 Nucleotide sequence databases GenBankwww.ncbi.nlm.nih.gov/Genbankwww.ncbi.nlm.nih.gov/Genbank EMBLwww.ebi.ac.uk/emblwww.ebi.ac.uk/embl DDBJwww.ddbj.nig.ac.jpwww.ddbj.nig.ac.jp

13 Types of nucleotide data cDNA –Reversely transcribed from mRNA Genomic sequences –Directly sequenced from DNA strings of various species EST’s –a tiny portion of an entire gene derived from mRNA

14 Macromolecular structure databases Protein Data Bank (PDB) Nucleic Acids Database (NDB) PDBsumwww.biochem.ucl.ac.uk/bsm/pdbsumwww.biochem.ucl.ac.uk/bsm/pdbsum CATH SCOPhttp://scop.mrc-lmb.cam.ac.uk/scop/ FSSPwww.embl-ebi.ac.uk/dali/fsspwww.embl-ebi.ac.uk/dali/fssp

15 Molecular interaction databases General –Biomolecular Interaction Network Databasehttp://bioinfo.mshri.on.ca/cgi-bin/bind/datamanhttp://bioinfo.mshri.on.ca/cgi-bin/bind/dataman –Molecular interactions Database (MINT) Protein-Protein interactions –Database of interacting proteins Biochemical pathways –KEGG Metabolic Pathwayshttp://

16 Proteomics databases Yeast Proteome Database SWISS-2DPAGE TMIG-2DPAGE

17 Genome databases Entrez genomes Ensemble genomes HIV Sequence Database FlyBasehttp://flybase.bio.indiana.edu/ COGswww.ncbi.nlm.nih.gov/COGwww.ncbi.nlm.nih.gov/COG

18 Integrated databases Increasing the value of information InterProwww.ebi.ac.uk/interprowww.ebi.ac.uk/interpro Sequence retrieval system (SRS) Entrezwww.ncbi.nlm.nih.gov/Entrezwww.ncbi.nlm.nih.gov/Entrez

19 Entrez Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure The (ever) Expanding Entrez System

20 EBI services

EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates The International Sequence Database Collaboration

22 A closer look at GenBank Maintained by NCBI Accessed through Entrez Entrez Synchonized with DDBJ and EMBL

23 Sequence file formats Ideally – a stringent, easy to parse, specified format to facilitate the dissemination of information Reality – a plethora of coincidental and badly specified formats Different levels of information Some common formats –FASTA –GenBank –PHYLIP (PHYLIP package and others) –Nexus (PAUP package, MacClade and others) –Up and coming: XML Simple – sequence and name attribute Advanced – several attributes

24 FASTA format >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTT GLLLNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKT VLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLP KERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYL NNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLE TISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXX XXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK

25 GenBank format A verbose but very informative format Contains much information in carefully specified format Harder to parse than FASTA

26 eXtensible Markup Language (XML) Markup language for data-representation – derived from SGML, sib of HTML Stringent simple language with rigid rules Human readable and versatile Good parsers exists for multiple platforms The ability to design own Document Type Definitions that parsers can use to validate a document permits complex data structures and grammars Examples of use for sequence data: –NCBI GBSeqXMLNCBI GBSeqXML –NCBI TinySeqXMLNCBI TinySeqXML

27 Links