Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
BIOINFORMATICS Ency Lee.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Gene Expression Omnibus (GEO)
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
Gene Expression Omnibus (GEO)
Basic Local Alignment Search Tool BLAST Why Use BLAST?
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Using ArrayExpress.
Getting the Most out of the PDBe
Archives and Information Retrieval
생물정보학 Bioinformatics.
Gene Expression Omnibus (GEO)
Introduction to Bioinformatics
Welcome to the Protein Database Tutorial
Ensembl Genome Repository.
Basic Local Alignment Search Tool
Lesson 3 Bioinformatics Laboratory
Chapter 3. THE GENBANK SEQUENCE DATABASE
Introduction to Databases
How to search NCBI.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the most important functions of a database: to reliably store and make accessible the data. Most protein sequences are predicted (i.e. annotated) from nucleotide sequence and therefore not curated. Secondary databases are repositories of “curated” data. Curated databases require human review of some kind, in addition to some experimental verification of the biological meaning of sequence data.

International Nucleotide Sequence Database Collection European Molecular Biology Laboratory EMBL(UK) GenBank (NCBI, USA) DNA Databank of Japan DDBJ All three organizations share 100% of their data. See Figure 1.1 in your text. One of the consequences of data sharing is that file formats must correspond

Flatfiles In biological databases, a flat file is a textfile, that usually contains one (sequence) record. Flat files are the indivisible unit of all sequence databases, but data in them can be display in a variety of formats. One of the most common formats for sequence records is called FASTA

A closer look at Flatfiles Name identifier: a unique identifier for each sequence. This is also known as the primary accession number Length of mRNA In this case, the sequence was submitted as an mRNA sequence. The “N” means nucleotide and the “M” means mRNA. See Box 1.2 i.e. not a circular molecule Like a plasmid Taxonomic code, not very useful anymore Date when last updated. The first line is called the header

Flatfiles continued The second line is called the Definition Line, the goal of which is to summarize the essential biological information encoded by the entry. 1. Genus species Gene name Note: Gene ontology can be confusing. In this case, the gene is named after a fruitfly mutant. Basic description of structure and function Type of molecule from which the sequence was derived. In reality, this would have been derived from a cDNA corresponding to a mRNA harvested from an embryonic cell

The most important entry. Primary database to reference to the sequence. If using this sequence in a publication, this is cited to refer readers to the database entry you used or created The version is very similar to the accession number, but if the sequence is updated either because it was wrong or incomplete, the number after the decimal indicates the version GenBank specific “geneinfo” identifier

Source organism Pretty self explanatory, except the difference between SOURCE and ORGANISM is that the latter is hyperlinked so one can go and investigate more…

All GenBank entries must be associated with a citation In essence, this ensures that the means by which the sequences were acquired have been peer reviewed, if not the sequence itself. This is what lends scientific credibility to the quality of these databases. This is an EMBL accession number, which means that it was not originally submitted through the GenBank portal

The only feature common to all three primary databases is the source feature All sequences must come from somewhere, so the minimum data (organisms and type of molecule) is entered here, with a link to the Taxonomy Browser. The list of acceptable database cross references (i.e. db_xref) to external links is strictly controlled. In this case it is still within the “Entrez” webspace, but others are possible.

All annotated nucleotide entries contain a “virtual” translation into amino acid sequence In this case, the translation is derived directly from a mRNA sequence, so there is a good chance it is correct, but if the translation is due to computationally derived genomic sequence, it should validated against a curated database.

And then, finally, the sequence data

So, flatfiles are informative, but what if you want to work with the sequence? The sequence data in the flatfile can be displayed or downloaded in a variety of different ways. A FASTA file is a very common format.

The simplest possible FASTA file >sequence AGTCCGATCGATCGTAGCTACGTACGTACGTAGCTA GCTACGTACGTACGATCGATGATCGATCGATCGATC GATCGATCGATCGATCGATCGATCGATCGATCG This FASTA sequence file has all of the necessary elements for a database entry, but it is not very informative. For example, we don’t know what database it is from, what organism is has come from, what molecule it encodes, if any etc.

FASTA format The chevron symbol “>” is important because it denotes the beginning of a new sequence. This is particularly important if you are using a file that contains multiple sequences for a query search, for example. >A sequence CAGCTGACAGATCGTACGATCGATGCGCACGAAGCACTACTAGCTAGGT >Another sequence CGCTAGCTCGCGATCGTATCAACGCGCGCGCGCGCGCATACTCACGCGC

Protein sequence databases Read Chapter one in book from “Protein Sequence Databases” to end of chapter With the exception of Protein Data Bank, which is a primary database composed of experimentally determined protein structure, all other protein databases are considered to be either mixed primary and secondary databases because they rely upon conceptual, or virtual translation of nucleotide data. GenPept is a secondary database, searchable through the “Protein” portal in Entrez. Caveat: errors in nucleotide sequence can be propagated. UniParc is a mixed primary and secondary database, and therefore attempts to be a comprehensive repository of amino acid sequences. Curated, Protein Data Bank is a primary database of protein structure determinations, using either X-ray crystallography or Nuclear Magnetic Resonance Spectroscopy.

Entrez Webspace This book will be your best friend. It is a comprehensive online documentation volume that attempts to fill the gap between a straightforward search in PubMed or BLAST and more advanced tasks. This webspace uses the concept of neighboring, which describes logical (i.e. natural) relationships between entries in one database and those in another.