Biological databases Nicky Mulder:

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Shelly Warwick, Ph.D – Permission is granted to reproduce and edit this work for non-commercial educational use as long as attribution is provided.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
UniProt - The Universal Protein Resource
Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
An Introduction to Bioinformatics Molecular Biology Databases.
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Motif discovery and Protein Databases Tutorial 5.
NCBI Literature Databases: PubMed
Introduction to Bioinformatics and Biological databases Nicky Mulder:
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Bioinformatics Computing
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Bioinformatics. History Margaret Dayhoff, 1965: Atlas of Protein Sequence and Structure Brookhaven, 1970s: Protein Data Bank (PDB) Needleman & Wunsch,
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Web services and genome annotation in GRID by DNA Data Bank of Japan (DDBJ) Center for Information Biology and DNA Data Bank of Japan National Institute.
Introduction to Genes and Genomes with Ensembl
Protein databases Henrik Nielsen
Archives and Information Retrieval
UniProt: Universal Protein Resource
محسن شیرازی کارشناسي علوم کتابداري و اطلاع رساني پزشکی
Chapter 3. THE GENBANK SEQUENCE DATABASE
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological databases Nicky Mulder:

What is a database an organized body of related infomation Data collection that is: –Structured (computer readable) –Searchable –Updatable –Cross-linked –Publicly available

Biological Databases Make data available to public So much data available, needs ordering Turn data into computer-readable form Ability to retrieve data from various sources Can have primary (archival) or secondary databases (curated) Most commonly used are sequence databases

Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data

Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data

Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data Ontologies -GO

Sequence databases Used for retrieving a known gene/protein sequence Useful for finding information on a gene/protein Can find out how many genes are available for a given organism Can comparing your sequence to the others in the database Can submit your sequence to store with the rest Main databases: nucleotide and protein sequence DBs

Requirements for good sequence database It must be complete with minimal redundancy It must contain as much up-to-date information (annotation) as possible on each sequence All the information items must be retrievable by computer programs in a consistent manner It must be highly interoperable with other databases

Nucleotide sequence databases EMBL, DDBJ, GenBank Data submitted by sequence owner Must provide certain information and CDS if applicable No additional annotation added Entries never merged –some redundancy Promoter Exons CDS (coding sequence)

Example EMBL entry 1: general info ID AB standard; genomic DNA; MAM; 6116 BP. AC AB083336; XX SV AB DT 06-JAN-2005 (Rel. 82, Created) DT 06-JAN-2005 (Rel. 82, Last updated, Version 1) DE Sus scrofa p27Kip1 gene for p27Kip1, p27Kip1R, complete cds, alternative DE splicing. OS Sus scrofa (pig) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Cetartiodactyla; Suina; Suidae; Sus. RN [1] RP RA Hirano K., Shintani Y., Hirano M., Kanaide H.; RT ; RL Submitted (08-APR-2002) to the EMBL/GenBank/DDBJ databases. RL Katsuya Hirano, Graduate School of Medical Sciences, Kyushu University, RL Division of Molecular Cardiology, Research Institute of Angiocardiology; RL Maidashi, Higashi-ku, Fukuoka, Fukuoka, , Japan RL Tel: , RL Fax: ) RN [2] RA Shintani Y., Hirano K., Hirano M., Nishimura J., Nakano H., Kanaide H.; RT "Cloning and Charaterization of full sequence of porcine p27Kip1 gene and RT expression of splice isoform p27Kip1R"; RL Unpublished. References Description of gene Accession number

Example EMBL entry 2: features on the sequence -CDS FH Key Location/Qualifiers FT source FT /db_xref="taxon:9823" FT /mol_type="genomic DNA" FT /organism="Sus scrofa" FT /cell_type="liver" FT /clone_lib="lambda Fix II porcine genomic DNA" FT exon FT /evidence=NOT_EXPERIMENTAL FT /note="The residue 2591 corresponds to the transcription FT initiation site determined in human gene" FT CDS join( , , ) FT /codon_start=1 FT /gene="p27Kip1" FT /product="p27Kip1R" FT /protein_id="BAD " FT /translation="MSNVRVSNGSPSLERMDARQAEYPKPSACRNLFGPVNHEELTRDL FT EKHCRDMEEASQRKWNFDFQNHKPLEGKYEWQEVEKGSLPEFYYRPPRPPKGACKVPAQ FT EGQGVSGTRQAVPLIGSQANSEDTHLVDQKTDAPDSQTGLAEQCTGIRKRPATDDSSPP FT SVSLKIGMYQLNYSSVW" Corresponding protein sequence Feature type and location Feature name and information

FT intron FT /cons_splice=(5'site:NO,3'site:NO) FT exon FT /number=2 FT intron FT /cons_splice=(5'site:NO,3'site:NO) FT exon FT /note="ending at a putative poly A site following a polyA FT signal" FT /number=3 FT polyA_signal XX SQ Sequence 6116 BP; 1583 A; 1392 C; 1438 G; 1703 T; 0 other; gcggccgcga gctcaattaa ccctcactaa agggagtcga ctcgatctcg aagccctttt 60 cttgttttta ttgagggaga gcttgggttc agaatacatt acaaatgcag catctattcc 120 agtctactta tagaaagacg tcctcctggg cttcccccct aagccccctg cctcccctag 180 aacagcacag acttctaggt taagggtgag ctaaccactg ctcaccccca gctaaggcac 240 ccaggctcag gggctccccg cctcccccgc tgagcgagcg gtgggggccc ccccgggaga 300 gagcccagct gggggccgag cgcccagcgg cgagcccagc tgcccgcccc tacccgctcg 360 gcgagcgagg ggaaaataag atcgccctcg gcgaggagag ggaggtcggg gctccggagc 420 Example EMBL entry 3: features on the sequence – introns and exons DNA sequence

Summary of information in EMBL entries Describes sequence type, e.g. genomic DNA, RNA, EST Provides taxonomy from which sequence came Provides information on submitters and references Describes features on a sequence NB for function, replication, recombination, structure etc. Shows if the DNA encodes a protein (CDS) and provides protein sequence Provides actual nucleotide sequence

Protein sequences DNA RNA Protein S S Ac Protein cleavage Protein modification Transported to organelle or membrane Folded into secondary or tertiary structure Performs a specific function All this info needs to be captured in a database

Protein Sequence Databases UniProt: –Swiss-Prot –manually curated, distinguishes between experimental and computationally derived annotation –TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation GenPept -GenBank translations RefSeq - Non-redundant sequences for certain organisms IPI –International protein Index –combination of many protein sequence databases

Example of a Swiss-Prot entry 1 References General information

Example of a Swiss-Prot entry 2 Cross- references Functional information

Example of a Swiss-Prot entry 3 Keywords Features Sequence

Swiss-Prot annotation mainly found in: Description (DE) lines –Protein name/function Comment (CC) lines –e.g. function, subcellular location, pathway, cofactor, disease, etc. Feature table (FT) –features on the sequence, e.g. domain, active site, modifications, variations, etc. Keyword (KW) lines –Set of a few hundred controlled vocabulary terms

Other parts to UniProt UniParc –archive of all sequences UniProt –Swiss-Prot + TrEMBL UniProt NREF100 (100% seqs merged) UniProt NREF90 (90% seqs merged) UniProt NREF50 (50% seqs merged)

Submitting sequences to EMBL or UniProt WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database. Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN

Sequence formats Not MSWord, but text! Most include an ID/name/annotation of some sort FASTA, E.g. >xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgc caatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtga ccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg Others specific to programs, e.g. GCG, abi, clustal, etc.

Literature database: PubMed/Medline Source of Medical-related & scientific literature PubMed has articles published after 1965 Can search by many different means, e.g. author, title, date, journal etc., or keywords for each Can save queries and results Can usually retrieve abstracts and full papers PubMed has list of tags to search specific fields, e.g. [AU], [TI], [DP] etc.

Search fields in PubMed Title Words [TI]MeSH Terms [MH] Title/Abstract Words [TIAB]Language [LA] Text Words [TW]Journal Title [TA] Substance Name [NM] Issue [IP] Subset [SB]Filter [FILTER] Secondary Source ID [SI]Entrez Date [EDAT] Subheadings [SH] EC/RN Number [RN] Publication Type [PT]Author Name [AU] Publication Date [DP]All Fields [ALL] Personal Name as Subject [PS]Affiliation [AD] Page Number [PG]Unique Identifiers [UID] Title Words [TI] MeSH Major Topic [MAJR] MeSH Date [MHDA]

Taxonomy Databases Most used is NCBI’s taxonomy database: omy Provides entries for all known organisms Provides taxonomic lineage and translation table for organisms Sequence entries for organism UniProt-specific taxonomy database is Newt:

Example taxonomy entry

Where to find the databases Table of addresses for major databases and tools Nucleic Acids Research Database issue January each year Nucleic Acids Research Software issue –new Amos’s list of tools: