Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
ProteinStructuralDatabases. Proteins are built from amino-acids. Introduction H | NH2-c-CO2H | R.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
UniProt - The Universal Protein Resource
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Biological Databases G P S Raghava. ctgccgatagc MKLVDDYTR o i d s e Where do the data come from? Example Databases literature Information New knowledge.
On line (DNA and amino acid) Sequence Information
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Macromolecular Visualization or… Where to go when ChemDraw just isn’t enough Martin Case Chem
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Copyright OpenHelix. No use or reproduction without express written consent1 1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Bioinformatics A Summary seminar (with many hints for exam questions)
EBI is an Outstation of the European Molecular Biology Laboratory. A web based integrated search service to understand ligand binding and secondary structure.
Protein databases Henrik Nielsen
Biological Databases By: Komal Arora.
Archives and Information Retrieval
생물정보학 Bioinformatics.
Genomes and Their Evolution
Introduction to Bioinformatics
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Introduction to Bioinformatics databases: Nucleic Acid Databases Neha Jain

What is Database General: A database is any collection of related data. A Computerized archive used to store and organize data in such a way that information can be retrieved easily. A database is a collection of interrelated data store together without harmful and unnecessary redundancy (duplicate data) to serve multiple applications Retrieving is called firing a query.

DATABASE SYSTEM Database System is an integrated collection of related files along with the detail about their definition, interpretation, manipulation and maintenance A database system is based on the data. Also a database system can be run or executed by using software called DBMS (Database Management System). A database system controls the data from unauthorized access. A database management system (DBMS) is a collection of programs that enables users to create and maintain a database.

What Does a DBMS Do? Database management systems provide several functions in addition to simple file management: allow concurrency control security maintain data integrity provide for backup and recovery control redundancy allow data independence provide non-procedural query language perform automatic query optimization What is a relational database? a database that treats all of its data as a collection of relations

Biological databases: why? Need for storing and communicating large datasets has grown Make biological data available to scientists. To make biological data available in computer-readable form.

Different classifications of databases Type of data –nucleotide sequences –protein sequences –proteins sequence patterns or motifs –macromolecular 3D structure –gene expression data –metabolic pathways

Different classifications of databases…. Primary or derived databases –Primary databases: experimental results directly into database –Secondary databases: results of analysis of primary databases –Aggregate of many databases Links to other data items Combination of data Consolidation of data

Different classifications of databases…. Availability –Publicly available, no restrictions –Available, but with copyright –Accessible, but not downloadable –Academic, but not freely available –Proprietary, commercial; possibly free for academics

9 NCBI and Entrez One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA) Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications and more.

Primary Databases This databases contains the raw nucleic acid sequence data which are produced and submitted by researchers worldwide. Nucleic acid EMBL GenBank DDBJ (DNA Data Bank of Japan ) Protein PIR MIPS SWISS-PROT TrEMBL NRL-3D

Nucleotide sequence databases EMBL, GenBank, and DDBJ are the three primary nucleotide sequence databases EMBL GenBank DDBJ They together constitute the International Nucleotide Sequence database callaboration.

Genbank An annotated collection of all publicly available nucleotide and proteins Set up in 1979 at the LANL (Los Alamos). Maintained since 1992 NCBI (Bethesda).

GenBank file format

EMBL Nucleotide Sequence Database An annotated collection of all publicly available nucleotide and protein sequences Created in 1980 at the European Molecular Biology Laboratory in Heidelberg. Maintained since 1994 by EBI- Cambridge.

DDBJ–DNA Data Bank of Japan An annotated collection of all publicly available nucleotide and protein sequences Started, 1984 at the National Institute of Genetics (NIG) in Mishima. Still maintained in this institute a team led by Takashi Gojobori.

Databases related to Genomics Contain information on genes, gene location (mapping), gene nomenclature and links to sequence databases; Exist for most organisms important for life science research; Examples: OMIM, GDB (human), MGD (mouse), FlyBase (Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.subtilis), etc.

Other NCBI nucleic acids DBs EST database: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA).EST database: HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs.HomoloGene: HTG database: A collection of high-throughput genome sequences from large-scale genome sequencing centers, including unfinished and finished sequences. HTG database: SNPs database: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms.SNPs database: RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within NCBI and with external groups, supports data-gathering efforts. RefSeq:

Nucleic acid structure databases NDB Nucleic acid-containing structures NTDB Thermodynamic data for nucleic acids RNABase RNA-containing structures from PDB and NDB SCOR Structural classification of RNA: RNA motifs by structure, function and tertiary interactions

Protein Sequence Databases One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and Structures" by Margaret Dayhoff and colleagues, first published in It contained the protein sequences determined at the time, and new editions of the book were published till It became the foundation of the PIR database. Protein Information Resource

SWISS-PROT: Annotated Sequence Database TrEMBL: Database of EMBL nucleotide translated sequences InterPro:Integrated resource for protein families, domains and functional sites. CluSTr:Offers an automatic classification of SWISS-PROT and TrEMBL. IPI: A non-redundant human proteome set constructed from SWISS-PROT, TrEMBL, Ensembl and RefSeq. GOA: Provides assignments of gene products to the Gene Ontology (GO) resource. Proteome Analysis: Statistical and comparative analysis of the predicted proteomes of fully sequenced organisms Protein Profiles: Tables of SWISS-PROT and TrEMBL entries and alignments for the protein families of the Protein Profile. IntEnz: The Integrated relational Enzyme database (IntEnz) will contain enzyme data approved by the Nomenclature Committee. Reference site : Protein Databases

Swiss-Prot A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants One entry for each protein Complete, Curated, Non-redundant and cross-referenced with 34 other databases

UniProt: The Universal Protein Resource (UniProt) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. It features BLAST, align sequence, retrieve sequences based on identifiers, and ID mapping from other databases such as GenBank, EMBL, DDBJ etc.

TrEMBL (Translation of EMBL) Created in 1996 as a computer annotated supplement to SWISS-PROT. Contains translations of all coding sequences (CDS) in EMBL. Has 2 main sections: 1.SP-TrEMBL: contains entries that will eventually be incorporated into SWISS-PROT, but that have not yet been manually annotated. 2. REM-TrEMBL: contains sequences that are not destined to be included in SWISS-PROT, these include immunoglobulins and T-cell receptors, synthetic and patented sequences and codon translations that do not encode real proteins. Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data… TrEMBL contains all what is not yet in SWISS-PROT

Structure Databases MSD:The Macromolecular Structure Database – A relational database representation of clean Protein Data Bank (PDB) 3DSeq: 3D sequence alignment server- Annotation of the alignments between sequence database and the PDB FSSP: Based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB) DALI: Fold Classification based on Structure-Structure Assignments 3Dee: Database of protein domain definitions wherein the domains have been clustered on sequence and structural similarity NDB: Nucleic Acid Structure Database

Protein DataBank (PDB) Important in solving real problems in molecular biology Protein Databank –PDB Established in 1972 at Brookhaven National Laboratory (BNL) –Sole international repository of macromolecular structure data –Moved to Research Collaboratory for Structural Bioinformatics

PDB: example HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C ) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL CA 11 JRNL REF J.BIOL.CHEM. V CA 12 JRNL REFN ASTM JBCHA3 US ISSN CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE CA 21 REMARK 3 RMSD BOND DISTANCES ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………

PDB (cont.) SHEET 3 S10 PHE 66 PHE O ASN 67 N LEU 60 12CA 68 SHEET 4 S10 TYR 88 TRP O PHE 93 N VAL 68 12CA 69 SHEET 5 S10 ALA 116 ASN O HIS 119 N HIS 94 12CA 70 SHEET 6 S10 LEU 141 VAL O LEU 144 N LEU CA 71 SHEET 7 S10 VAL 207 LEU O ILE 210 N GLY CA 72 SHEET 8 S10 TYR 191 GLY O TRP 192 N VAL CA 73 SHEET 9 S10 LYS 257 ALA O LYS 257 N THR CA 74 SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA CA 75 TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76 TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77 TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78 TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79 TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80 TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81 CRYST P CA 82 ORIGX CA 83 ORIGX CA 84 ORIGX CA 85 SCALE CA 86 SCALE CA 87 SCALE CA 88 ATOM 1 N TRP CA 89 ATOM 2 CA TRP CA 90 ATOM 3 C TRP CA 91 ATOM 4 O TRP CA 92 ATOM 5 CB TRP CA 93 ATOM 6 CG TRP CA 94 ATOM 7 CD1 TRP CA 95 ATOM 8 CD2 TRP CA 96 ATOM 9 NE1 TRP CA 97 ATOM 10 CE2 TRP CA 98 ATOM 11 CE3 TRP CA 99 ATOM 12 CZ2 TRP CA 100 ATOM 13 CZ3 TRP CA 101 ATOM 14 CH2 TRP CA 102 …….

Databases related to Proteomics Contain information obtained by 2D-PAGE: master images of the gels and description of identified proteins Examples: SWISS-2DPAGE ( Two-dimensional polyacrylamide gel electrophoresis database), ECO2DBASE, Maize-2DPAGE, Sub2D, Cyano2DBase, etc. Format: composed of image and text files Most 2D-PAGE databases are “federated” and use SWISS-PROT as a master index Mass Spectrometry (MS) database

Munich Information Center for Protein Sequences (MIPS) A research centre hosted at the Institute for Bioinformatics (IBI) at Neuherberg, Germany. Contains information for Systematic analysis of genome information including the development and application of bioinformatics methods in genome annotation, gene expression analysis and proteomics. MIPS supports and maintains a set of generic databases as well as the systematic comparative analysis of microbial, fungal, and plant genomes. 10/10/2015 4:01 AM

The Institue of Genomic Research (TIGR) Maintained by The Center for the Advancement of Genomics (TCAG) Its Database is TDB TDB: A database of The Institute of Genomic Research:provides a substantial suite of databases containing DNA and protein sequence, gene expression, cellular role, protein family information, and taxonomic data for microbes, plants and humans. 10/10/2015 4:01 AM

HOVERGEN : Homologous Vertebrate Genes Database HOVERGEN is a database of homologous vertebrate genes. It allows one to select sets of homologous genes among vertebrate species, and to visualize multiple alignments and phylogenetic trees Thus HOVERGEN is particularly useful for comparative sequence analysis, phylogeny and molecular evolution studies. Divided into 2 parts 1.HOVERGEN contains the protein sequences 2. HOVERGENDNA contains the associated nucleotide sequences. The database contains all vertebrate protein sequences from the UniProt Knowledgebase (Swiss-Prot and TrEMBL)

The Arabidopsis Information Resource TAIR TAIR maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana. Data available from TAIR includes the complete genome sequence along with gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis research community. Its an up to date database which updates in every 2 weeks 10/10/2015 4:01 AM

PlasmoDB: a functional genomic database for malaria parasites PlasmoDB ( is a functional genomic database for Plasmodium spp. that provides a resource for data analysis and visualization in a gene-by-gene or genome-wide scale. The latest release, PlasmoDB 5.5, contains numerous new data types from several broad categories— annotated genomes, evidence of transcription, proteomics evidence, protein function evidence, population biology and evolution. 10/10/2015 4:01 AM

ECDC (European Centre for Disease Prevention and Control) The European Centre for Disease Prevention and Control (ECDC) was established in It is an EU agency aimed at strengthening Europe's defences against infectious diseases. ECDC publishes scientific and technical reports on various issues related to communicable diseases prevention and control, including comprehensive reports from key technical and scientific meetings. 10/10/2015 4:01 AM

Other Databases KEGG (Kyoto Encyclopedia of Gene and Genomics) – for Pathways GeneCards – A databases of human genes, their products and their involvement in diseases. It’s a secondary database which contains link for many other databases. All in one database of human genes (a project by Weizmann institute) Attempts to integrate as many as possible databases, publications and all available knowledge There are many databases available for microarray, SAGE, ESTs and SNPs.

FASTA Format Popular Format and commonly used > Seq1 ALVLRARLATGPATGCTRTARARLATGALVLRARLATGPARARLATGPATGCTRTARA RLATGALVLRARRLATGPATGCTRRLATGPATGCTRRARLATGPATGCTRTARARLAT GALVLRAR >Seq2 TGCTRTARARLATGALVLRARLATGPARARALVLRARLATGPATGCTRTARATGALVL RARLATGPARARALVLRARLATG >Seq 3 ……..

Identifiers and Accession numbers Identifier: string of letters and digits that generally is “understandable” –Example: TPIS_CHICK (Triose Phosphate Isomerase from chicken (gallus gallus) ) in SwissProt –The identifier can change (based on the curator) Accession code: a string of letters and digits that uniquely identifies an entry in its database. –The accession number for TPIS_CHICK in Swissprot is P00940 –Accession number should not changed!!

10/10/2015 4:01 AM

43 Google scholar

10/10/2015 4:01 AM

45 Exercise Retrieve all publications in which the first author is: Mayrose I and the last author is: Pupko T

46 The MOST important of all 1.Google (or any search engine)

47 And always remember: 2.RTM – Read the manual!!

48 Help! Read the Help section Read the FAQ section Google the question!