Biological databases.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistant: Shula Shazman Sivan Bercovici Course web site :
Archives and Information Retrieval
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy.
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
An Introduction to Bioinformatics Molecular Biology Databases.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
On line (DNA and amino acid) Sequence Information
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 1.
Bioinformatics.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Organizing information in the post-genomic era The rise of bioinformatics.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Copyright OpenHelix. No use or reproduction without express written consent1.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Introduction to Bioinformatics
Retrieving Information: Using Entrez
Archives and Information Retrieval
Mangaldai College, Mangaldai
gene-CENTRIC database
Chapter 3. THE GENBANK SEQUENCE DATABASE
How to search NCBI.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological databases

The Progress # of dna base pairs (billions) in GenBank First 2 bacterial genomes complete 122+ bacterial genomes Data from NCBI and TIGR (www.ncbi.nlm.nih.gov and www.tigr.org ) first eukaryote complete (yeast) first metazoan complete (flatworm) 17 eukaryotic genomes complete or near completion including Homo sapiens, mouse and fruitfly Official “15 year” Human Genome Project: 1990-2003. # of dna base pairs (billions) in GenBank In the last 5 years or so the amount of data has grown exponentially, including the growth of online databases and resources. In this slide we have the growth of the total number of base pairs and the total number of genomes completed since the beginning of the Human Genome Project. As you can see, the sheer number and growth of this resource has been impressive—and daunting—in the last 5 years. Few scientists are aware of, or make full use of, all the open-source and public resources available to them through the internet. The Annual Nucleic Acids Research Database issue listing contained 548 databases this year!! And, as this quote mentions, only half of those who use the databases are familiar with their tools. This Wellcome Trust study also made it clear that many people become users of a database after being told about it by colleagues. “Despite the large amount of publicity surrounding the Human Genome Project, a recent survey conducted on behalf of the Wellcome Trust indicates that only half of biomedical researchers using genome databases are familiar with the tools that can be used to actually access the data. In “The Molecular Biology Database Collection: 2003 update” by Andreas D. Baxevanis in the Jan 1, 2003 NAR database issue.

International nucleotide sequence Database collaboration. EMBL European Molecular Biology Laboratory http://www.ebi.ac.uk DDBJ (Japan) PubMed, Nucleotides Proteins Genomes Taxonomy Structure Domains GenBank (NCBI) http://www.ncbi.nlm.nih.gov

NCBI - GenBank GenBank: All publicly available nucleotide and amino acid sequences. Data Source: Direct submission from scientists Literature. Genome Sequencing DNA database divisions (examples) Organism division (Human, Bacteria, etc). Molecule division (DNA, RNA, protein). Sequence division (Genome, ESTs STSs).

sequence databases An optimal database should be: Comprehensive, well annotated, easily searched & easy data retrieval, provide cross-references The GenBank database: As of April 2004, there are over 8,989,342,565 bases in GenBank. Problems 1: huge databases  Redundancy and inadequate sequences. Problem 2: Submission by users  Redundancy, Only the submitter can change it, not always up to date, partial annotation. A database is a collection of data that is organized so that its content can easily be accessed, managed and updated

GenBank HELP!!! http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html Instructions: Nucleotide database human[Organism] AND dUTPase[Title] without limits Add limits on ESTs. (EST: mRNA origin. STS: markers. TPA: third party, GSS: sequences are genomic in origin, unlike mRNA origin… explain all limits!) Show how to do it in preview/Index! Look at complete CDS, and then in exon 3 : redundency Analyze both complete cds and exon 3. Show fasta format!!! + send to file, to text, to clipboard Look at protein database!

Unique Identifiers at NCBI accession numbers apply to a complete sequence record sequence identification numbers apply to the individual sequences within a record GI number assigned consecutively by NCBI to each sequence it processes Version number accession number followed by a dot and a version number. The format of accession numbers varies, depending upon the source database: GenBank/EMBL/DDBJ - One letter followed by five digits, e.g.: U12345 or two letters followed by six digits, e.g.:AY123456 Swiss-Prot - All are six characters: [O,P,Q][0-9][A-Z,0-9][A-Z,0-9][A-Z,0-9][0-9] e.g.:P12345 and Q9JJS7 RefSeq - Two letters, an underscore bar, and six digits, e.g.:NM_000492 (mRNA) NT_ (contig) NC (chromosome) NG (genomic region). If a sequence changes in any way, it receives a new GI number, and the version number is incremented by one.

Data Formats Many data formats used by sequences: FASTA format, GenBank format, EMBL format…

GenBank format See http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

GenBank format

GenBank format

FASTA format Example: Easy to parse Least informative >my_sequence_name BTYKLJGJFKHVHFMGHF KHGJFJFVKHGJHLNLNLJ KJGKGKGKHLJH Easy to parse Least informative Default input format for sequence analysis software (e.g., BLAST, CLASTALW).

TrEMBL Swiss-Prot (http://www.ebi.ac.uk/swissprot/) Core data: sequence, taxonomy and bibliographic reference. Annotation data: function, domain structure, post-translational modifications, protein variants, etc. a curated protein sequence database provide a high level of annotation minimal level of redundancy high level of integration with other databases (cross references). TrEMBL a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

ExPASy Proteomics Server http://www.expasy.org/

Swiss-Prot file format entry

Flat-file original Swiss-Prot format

More resources Some good places for refreshing your biochemistry Address Description www.glycosuite.com The glycan structure database lipid.bio.m.u-tokyo.ac.jp The ultimate lipid database chem.sis.nlm.nih.gov/chemidplus/ ChemIDplus: Identifying molecules by drawing them up! The main resources for biochemical pathways and enzymes Address Description www.expasy.ch/cgi-bin/search-biochem-index Find which metabolic pathway a molecule belongs to. www.genome.ad.jp/kegg/ The famous Kyoto Encyclopedia of Genes and Genomes (KEGG). E.C. (Enzyme Codes) numbers or gene names are the best starting points for this resource. brenda.bc.uni-koeln.de The comprehensive enzyme information system BRENDA. www.chem.qmul.ac.uk/iubmb The official site for enzyme nomenclature of the International Union of Biochemistry and Molecular Biology (IUBMB). www.ecocyc.org The Encyclopedia of E. coli Genes and Metabolism. It is progressively extending to other bacteria.

Search sequence databases Two search methods Text based searching– searches textual information contained in header sections of database entries Sequence search– searches sequence information with sequence queries – next week!

Text based searching - Search for query words in specific fields. Choose your database and add limits. Examples: Entrez, SRS.

NCBI – Entrez (http://www.ncbi.nih.gov/Entrez/) Entrez is the search tool for NCBI databases. The search starts by choosing the relevant group of databases (Nucleotide, Protein, etc). Use field qualifiers, logical operators, and a “limits” form. Boolean operator, AND, OR, NOT Group together by using () Example: cytochrome AND human cytochrome AND (human OR mouse) Always use upper case for operators. If you don’t use any operator the query words are looked together! Field qualifiers: Search in the specific field: Author, organism, journal … homo sapiens [organism] AND kinase AND nature [journal] Cytochrome b Cytochrome b AND human Cytochrome b AND human[organism] Cytochrome b AND human[organism] and limits.

Entrez Protein Database http://www. ncbi. nlm. nih. gov/entrez/query Includes SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.

Entrez Nucleotides database http://www. ncbi. nlm. nih Includes GenBank, RefSeq, and PDB. As of April 2004, there are over 38,989,342,565 bases.

SRS http://srs.ebi.ac.uk/ Choose Library Fill Query form Get Results

Gene-centric Databases Repository-type database: - Many pieces of sequences related to a sequence - Examples: GenBank/SwissProt Gene-centric database: All the sequence information relevant to a given gene is made accessible at once: Get the whole story at once! Provide easy access when the query is related to a gene or function. Examples: Gene, UniGene, RefSeq.

Gene http://www.ncbi.nih.gov/entrez/query.fcgi?db=gene Gene provides a unified query environment for genes Query on names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, and many other attributes associated with genes and the products they encode. Unique identifiers assigned to genes with known map positions. Supply key connections of map, sequence, expression, structure, function, citation, and homology data. Provide identifiers to UniGene, RefSeq, relevant GenBank entries, OMIM and SNPs. Can be considered as the successor to LocusLink

Refseq http://www.ncbi.nlm.nih.gov/projects/RefSeq/ non-redundancy   distinct accession series updates to reflect current knowledge of sequence data and biology ongoing curation by NCBI staff and collaborators, with reviewed records indicated. data validation and format consistency

ESTs division Uses: Problems: Gene predication. Expression level (only clues). Alternative splicing. Problems: Redundant database. mistakes (single read-through). Incomplete coverage of genes: Only for Model eukaryotic organisms Rare tissues Low copy number of genes

UniGene http://www.ncbi.nlm.nih.gov/UniGene An automatically partitioning of GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. Focus on mRNA and EST information

Organism specific databases

Wouldn’t it be great if… sequence Genome backbone: base position number chromosome band known genes predicted genes evolutionary conservation SNPs sts sites gap locations repeated regions microarray/expression data more… Annotation Tracks Links out to more data A great deal of information has come to us from the formal, organized Human Genome Project. But other data has come from individual laboratories doing traditional benchwork; some has come from the literature; and some of the data has come from new large-scale technologies that have arisen in the last few years, such as microarray data for gene expression detection. So—there is tremendous amounts of data around; and lots of places to try to find it. But—the UCSC Genome browser is great because it organizes a lot of this material in one place. It uses the backbone of the genome—the official backbone sequence of the Human Genome Project is nicknamed the Golden Path—and combines this golden path information with all kinds of other useful and important biological information, such as chromosome banding patterns, known, genes, predictions, expression data, , comparative genomics, SNPs, and so on…. All of this data is lined up in one place so you can quickly find new information about your regions of interest. And better still, all the data links out to other databases and web sites and literature so you can go as deep as you want into any topic that you care about…. As I show here in this diagram, the data is organized along the genomic sequence backbone. All of that other information that is available is referred to as “Annotation Tracks”. Later we’ll see that you can get to these regions of interest, and then link out to other great collections of data as well.

Solution: Genome Browsers, Or “map Viewers” Introduce self. introduce the section. Our goal here is to cover the basics of getting the genome browser software to work for you; we want to introduce you to searching the USCS Genome Browser via text or sequences to get the information that you want. The materials used in this slide presentation were developed by Warren Lathe and Mary Mangan, from OpenHelix, LLC, under contract from the UCSC Genome Bioinformatics group.

NCBI Map Viewer http://www.ncbi.nlm.nih.gov/Genomes/

Ensemble (http://www.ensembl.org/) Ensemble example: http://www.ensembl.org/Docs/linked_docs/human_eg_19_34.pdf

UCSC Home page ( genome.ucsc.edu ) navigate General information Okay, so lets move on to what the web site actually looks like, and how to get your searches accomplished! When you first arrive at the UCSC genome bioinformatics site, this is what you will see. First, there is a section that contains general information about the site. Second, there is specific information about NEWS--new features, changes, the current state of the data that is available. This information is worth a quick check when you visit the site, in case there have been changes to the data since the last time you visited. But the real substance of the site—the tools—are accessible in a couple of ways from this page. There are navigation bars at the top and the sides which will permit you to access all of the really cool stuff that is going on here: This page will provide you access to the several types of tools that are available from this site: Tools = Browser, Blat, Tables, Downloads, FAQ, User Guide. Access from either navigation option—top or side. Mirrors = other locations, just in case one isn’t available to you—or one might have faster access from your location Archives = older data, sometimes you might need to troll through older data to re-examine something you found before Credits = these are the people who bring you this browser, very important. Cite Us = please cite the resource properly in your talks and papers; this helps them get grant $$!! Jobs = anyone looking? Links = a great collection of links to other tools that might be of use/interest to you Contact us = mailing list, error reports, mirroring information To actually get in and start searching the database, there are several options—you can search by text—gene name, gene symbol, keywords, ID, etc. To do this we will use the Genome Browser link. You can also search by sequences if you have a specific sequence of interest using the BLAT search tool—but we will start with text searching from the Genome Browser gateway—the link at the top for Genome Browser. However, the link that says Genomes will get you to the same location. For our purposes, we will click the link that says Genome Browser. Specific information— new features, current status, etc. UCSC Material developed by W.C. Lathe and M. Mangan, info@openhelix.com