Retrieving Information: Using Entrez

Slides:



Advertisements
Similar presentations
Bioinformatics growth curves Medline records Computer power DNA sequences 3-D structures.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Bunu databases’in icine koy lecture 5i de sonuna
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Introduction to Bioinformatics Introduction to Databases
Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
Basic Local Alignment Search Tool BLAST Why Use BLAST?
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Sequence Tracking Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 Understanding your sequence context.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
The National Library of Medicine and its databases Lívia Vasas, PhD
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Genes and Genomes with Ensembl
The National Library of Medicine and its databases
Introduction to Bioinformatics
The NCBI Annotation Pipeline
Basics of BLAST Basic BLAST Search - What is BLAST?
Archives and Information Retrieval
What is Bioinformatics?
The National Library of Medicine and its databases
There are four levels of structure in proteins
BLAST.
Basic Local Alignment Search Tool
Chapter 3. THE GENBANK SEQUENCE DATABASE
How to search NCBI.
Presentation transcript:

Retrieving Information: Using Entrez B.F. Francis Ouellette francis@bioinformatics.ubc.ca Lecture 2.2

Before we get started Curated vs non-curated Databases Intellectual value added by curation process. Updated of record, a case study: Lecture 2.2

Updates -- the 4 “W” Who updates? Submitters, Journals, “3rd party” What to update? Gene names, citations, new product, sequencing errors Where? update@ncbi.nlm.nih.gov Why update? Lecture 2.2

example Lecture 2.2

Lecture 2.2

Lecture 2.2

To: ddbjupdt@ddbj.nig.ac.jp Subject: D25291 mito Dear colleagues, From francis Wed Mar 3 22:32:19 1999 To: ddbjupdt@ddbj.nig.ac.jp Subject: D25291 mito Dear colleagues, it appears that DDBJ record D25291 is contaminated with mitochondrial sequences from nucleotide 673 to 1803, as it is identical to mouse mitochondrial sequence (EMBL V00711) for more than 1100 nucleotides. I would recommend deleting that segment of the record, or removing the record altogether, as it leads to unfortunate misinterpretation of the data when using GenBank or DDBJ. The protein sequence (which is erroneous, as it is all of mitochondrial origin) should definitely be removed as well. …. LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 Lecture 2.2

Sequence Updated length Date DEF Version GI LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 VERSION D25291.1 GI:1850791 length Date DEF Version GI LOCUS MUSNGH 619 bp mRNA ROD 12-MAR-1999 DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 mRNA. ACCESSION D25291 VERSION D25291.2 GI:4520413 Lecture 2.2

Another guiding principals when using GenBank Data in GenBank is only as good as what you put in: applying and ensuring this (in an active, day to day fashion) will only make everybody’s work that much easier. Lecture 2.2

Nature 2001 Jan 25; 409:452 Lecture 2.2

Check sequence revision history Nucleotide Go Check sequence revision history Check sequence revision history Lecture 2.2

Check the history for D25291 Lecture 2.2

Lecture 2.2

Back to Today’s lecture Using Entrez … Lecture 2.2

Retrieving information: how it works: Servers have the records you want You need to understand the data they have, and how it is organized There are often many ways to get to an answer. Route to get there is not always obvious, but you need to think of alternatives and traps. Use some query language – each system has its own. Retrieve data in a specified format. Save it in a way that will be useful to you. Lecture 2.2

What you may be looking for: Did a BLAST search – and you need more info about some of the proteins they found similarities to. Heard on CBC about a disease gene that was recently discovered, and you want to know more about it. Want to build a dataset for local blast searches. A colleague wants you to do an alignment of all sequences from a given protein family. Lecture 2.2

What you are looking for: PubMed paper from author X Sequence from gene X in organism Y All information about organelle W in model organism Y All information about disease X in human Orthologs of that disease genes in other model organisms Lecture 2.2

Central Dogma in Biology DNA RNA The central dogma in Biology: DNA makes DNA and also makes RNA which makes proteins protein Lecture 2.2

Central Dogma: NCBI version DNA RNA Write a paper about it protein Lecture 2.2

Entrez: Pathway to Discovery Term frequency statistics 1993 MEDLINE abstracts Literature citations in sequence databases Literature citations in sequence databases Nucleotide sequences Protein sequences Nucleotide sequence similarity Amino acid sequence similarity Coding region features Lecture 2.2

Type in your last name and find a paper form one of your teammates Related Articles Lecture 2.2

Hard link DNA to protein L12345 Lecture 2.2

2003 From Fig 1 of Entrez search and retrieval system Jim Ostell Chapter 14, the NCBI Handbook. 2003 Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Ctrl-F Lecture 2.2

Lecture 2.2

Getting started in Entrez Lecture 2.2

A query: All the yeast papers by the instructor of this lecture 1st challenge: how do you spell his name again? What are his initials? Do I need to know the Greek name for yeast? Which yeast? Lecture 2.2

“ouellette bf” [au] AND yeast Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

MeSH: Medical Subject Heading Lecture 2.2

A query Word <free text> : too many hits More words (the Boolean ‘AND’ is the default) Limit query to specified field Limit query in time Do Boolean on queries #1 AND #2 #3 NOT #5 #7 OR #8 Lecture 2.2

hieter p [au] Lecture 2.2

Limit in Time: 1993-01-01 1993-12-31 Lecture 2.2

Lecture 2.2

Full Text in PubMed Central No abstract With abstract Full Text on-line Full Text in PubMed Central Lecture 2.2

boguski m [au] 99 boguski ms [au] 80 Lecture 2.2

#24 NOT #23 19 Lecture 2.2

Lecture 2.2

Other types of links in Entrez Next slides to explore other kind of things linked into Entrez records. Lecture 2.2

“hieter p” [au] cdc16p Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

“Books” Lecture 2.2

(2) Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Link to Genome View of Chromosome I Lecture 2.2

Lecture 2.2

Lecture 2.2

RefSeq RefSeq represents the NCBI curated “reference sequences” for all ‘worked’ genome. Historically, these used to be referred to as “GenBank-Gold”. RefSeq are either genomic, mRNA or protein sequences. Not all sequences are in RefSeq All RefSeq sequences are assembled/taken from things in GenBank. Lecture 2.2

Some of the features of the RefSeq:  non-redundancy   explicitly linked nucleotide and protein sequences   updates to reflect current knowledge of sequence data and biology   data validation and format consistency   distinct accession series   ongoing curation by NCBI staff and collaborators, with review status indicated on each record Lecture 2.2

Accession number space GenBank: 1+5 (L12345, U00001) 2+6 (AF000001, AC000003) 4+2+6 (WGS) All have accession.version Protein: 1+5 (SwissProt/UniProt) 3+5 (GenPept) RefSeq: N*_12345 Lecture 2.2

RefSeq Accession Number Space NC_123456 Genomic Complete genomic molecules including genomes, chromosomes, organelles, plasmids. NG_123456 Incomplete genomic region; supplied to support the NCBI Genome Annotation pipeline. NM_123456 mRNA NR_123456 RNA Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others NP_123456 Protein NP_12345678 Planned expansion of accession series Lecture 2.2

Automated Assemblies NT_123456 Genomic Intermediate genomic assemblies of BAC sequence data NW_123456 Intermediate genomic assemblies of Whole Genome Shotgun sequence data Lecture 2.2

Model RefSeq records XM_123456 mRNA model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig. XR_123456 RNA model non-coding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig. XP_123456 Protein model proteins provided by the Genome Annotation process; sequence corresponds to the genomic contig. Lecture 2.2

WGS special case NZ_ABCD12345678 Genomic A collection of whole genome shotgun sequence data for a project. Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project. ZP_12345678 Protein Proteins annotated on NZ_ accessions (often via computational methods). Lecture 2.2

Download all the data Entrez and RefSeq Lecture 2.2

Lecture 2.2

Lecture 2.2

Lecture 2.2

Locus Link Lecture 2.2

Things to watch out for: Lecture 2.2

Lecture 2.2

In-Lab Exercise Questions for this exercise in Binder Work with your teammates Feel free to explore the information space In PubMed look up your favorite author, people you went to school with, or people at an institution you would like to work at some day! This Lab is NOT marked Lecture 2.2