NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Slides:



Advertisements
Similar presentations
Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Advertisements

Zoology 305 Library Databases/Indexes Lab Goals for session: 1) Meet your librarian Kevin Messner 2) Understand.
PubMed Advanced: Linking PubMed to NCBI Genetics Databases KTL Vaughan Librarian for Bioinformatics & Pharmacy UNC-CH Health Sciences Library.
Databases (“knowledge bases”) used in genome analysis
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Bioinformatics.
Bioinformatics for biomedicine
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
PubMed and other Online Tools Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries/ U.F. Genetics Institute GMS 6014 January.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Copyright OpenHelix. No use or reproduction without express written consent1.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Copyright OpenHelix. No use or reproduction without express written consent1.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Organizing information in the post-genomic era The rise of bioinformatics.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Copyright OpenHelix. No use or reproduction without express written consent1.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
Copyright OpenHelix. No use or reproduction without express written consent1.
This tutorial will describe how to navigate the section of Gramene that provides descriptions of alleles associated with morphological, developmental,
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics and Computational Biology
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
E-utilities: Short course. The Entrez Query System at NCBI.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Archives and Information Retrieval
What is Bioinformatics?
Welcome to the Protein Database Tutorial
Searching the NCBI Databases
Gene Safari (Biological Databases)
How to search NCBI.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015

Entrez Nucleotides

Entrez Nucleotides (GenBank) Database of nucleotide sequences (ATGC) Actually contains data from several databases - GenBank, EMBL, DDBJ, RefSeq Hard to search because many submitting scientists send in redundant information and poorly annotated information

Nucleotide Data Domain As of December 15, 2014 Over 184,938,063,614 bases Over 179,295,769 sequence records Some complete genomes and chromosomes

So Why So Hard to Search? No controlled vocabulary - lose power of MeSH - must OR synonyms. Often miss the records you want. Archival - quality of annotations depends on the submitter (especially features field); little to no quality control; spelling errors! Often miss the records you want. Redundant - lots of records for the same gene; partial records, etc. Often pull up records you don’t want.

GenBank Sample Record Before searching, we will look at a GenBank sample record Note that the “ Features ” field provides useful biological information, and may be searched

Click any link in sample record to access definition of field and search tips “Definition” field acts as record title – search [titl] Unique identifier; assigned by NCBI; required by journals/grants Link to PubMed citation/abstract

The “Features” field provides the most biological information; search as [fkey] Numbers indicate location on the nucleotide sequence

…3158

GenBank Identifiers Accession Number - U49845 [accn] Unique identifier; does not change Letter prefix no longer has significance Version - U If any change to sequence, version U created GenInfo Identifier (GI number) [uid] Run parallel to accession.version system; change in sequence changes number

Searching “Nucleotides”Nucleotides Database is difficult to search: Redundant records Archival - poor or missing annotation Best searches are done using commands; need a class to learn all Practice search – search for sequences for human presenilin 1 Is there anything odd about the some of the retrieved results?

Search for HUMAN presenilin 1 But end up with rat, mouse, etc. Choose “nucleotide” from dropdown, then click “search”

Searching “Nucleotides” We retrieved the non-human and PSEN2 (rather than PSEN1) records because the computer looked for the terms “ human ” and “ presenilin 1 ” ANYWHERE in the record (click on details tab to see how the computer parsed your search) Use complex boolean searching to clean this up: term [field] AND term [field]

Searching “Nucleotides”Nucleotides How to get rid of non-human sequences? Search human [orgn] (this works for any taxon) How to get rid of non-presenilin 1 sequences? Another trick – search PSEN1 [gene] Note – you may miss relevant sequences, but should not pick up irrelevant sequences The sequences that you miss are the ones that have not been annotated with the current official gene symbol in the “ gene ” field DO NOT use this method if you need to find every sequence for a particular gene Human [orgn] AND PSEN1 [gene]

Use these filters to choose molecule type, confine to RefSeq records This is the search that was completed using fields (orgn, gene) and filters

How Can I Find “Best” SequencesSequences Non-redundant, curated subset of the sequence data domains Contains one record for each gene or splice variant from each organism represented Records can be thought of as “ review articles ” for sequences “ Best ” (usually longest) sequence used as seed Value-added annotations provided by experts Easy – a tab now exists to limit retrieval to just RefSeq

Click on the RefSeq link to retrieve only the “best” sequences (highly annotated, complete, nonredundant) The typical RefSeq accession number format: 2 letters, an underscore, and then numbers

Viewing Formats The “ Default ” view is the standard GenBank record GenBank Researchers often use the “ FASTA ” format for analysis Change the record format at the “ Display ” pull-down menu

Entrez Proteins

Contains data from several databases: SwissProt, PIR, PRF, PDB Translations from annotated coding regions in GenBank and RefSeq Redundant archival data domain of publicly available protein sequences

Searching Entrez Proteins Searched like Entrez Nucleotides “Filters” choices differ; includes molecular weight and sequence length filters

Entrez Gene

Gene Pulls together information (sequences, structures, literature, gene models, pathways, etc.) for genes Best place to start for “ gene-centered ” info One record per gene per organism Search by names, symbols, accessions, publications, GO terms, chromosome numbers, E.C. numbers, etc.

Search using gene symbol Could have searched under any of these aliases (unlike GenBank where you would have to try them all)

Official gene symbol as determined by the Human Genome Nomenclature Commission

Summary of protein, function and disease- causing mutations; from RefSeq record Links to PubMed records that provide evidence of function – any researcher can add these

Links to OMIM records of phenotype/ disease Gene Ontology terms form a controlled vocabulary with three components – biological process, molecular function, and cellular component Links to homology maps Links to protein interactions

Pathway info may be available from the Kyoto Encyclopedia of Genes and Genomes Sequence and domain links Links to GeneReviews – clinical resource

Taxonomy Browser

Search Taxonomy Browser Taxonomy BrowserTaxonomy Browser How many genera from the family Iguanidae are represented by sequence data? How many nucleotide and protein sequences are available for the family?

Entrez Searching Summary

To Find Everything(?) Broaden Search OR together synonyms OR together related terms (gene name, gene symbol, protein name, alternate spellings, disorder) Don’t specify a field- search entire record Truncation - use * at end of word root Click “Related Records” Try using Taxonomy Browser to pick up all taxa in a particular group

Fewer/Best Records Narrow Search Search particular fields: PubMed - MeSH Browser, subheadings, major MeSH Nucleotide - features, title, gene, properties, organism Use “Filters” Search only the RefSeq database

Will Entrez Find Every Sequence Record? No!!! Entrez relies on annotation of records, so you are searching solely on “terminology” Some records are not annotated, some records are poorly or incorrectly annotated To find all useful sequences – need to search on sequence itself Related sequence link BLAST

Entrez “Related Records” Will vary depending on data domain PubMed related articles PubMed Based on a “word weight” algorithm – MeSH, title, abstract words In order by weight (highest weight first) Nucleotide and protein related sequences Nucleotide Based on basic BLAST search In order by best BLAST score