A Field Guide to GenBank and NCBI Molecular Biology Resources

Slides:



Advertisements
Similar presentations
The Life Sciences Search Engine
Advertisements

NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
Bunu databases’in icine koy lecture 5i de sonuna
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
An Introduction to Bioinformatics Molecular Biology Databases.
Introductory Overview
On line (DNA and amino acid) Sequence Information
Bioinformatics.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
2 February, 2007 Life Science: Organisms. 2 February, 2007 Genomics “The genetic blueprints of all people generally have the same information, with approximately.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.
NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Introduction to Bioinformatics Introduction to Databases
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Introduction to Genes and Genomes with Ensembl
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Archives and Information Retrieval
생물정보학 Bioinformatics.
محسن شیرازی کارشناسي علوم کتابداري و اطلاع رساني پزشکی
Finding the needle in your DNAstack Ana Teresa Freitas Ciência 2010 – Encontro com a Ciência e Tecnologia em Portugal FIL, July 7,
Chapter 3. THE GENBANK SEQUENCE DATABASE
Presentation transcript:

A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/

NCBI Resources About NCBI NCBI Sequence Databases Primary Database – GenBank Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources

The National Center for Biotechnology Information (NCBI) Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001)

NCBI Home Page http://www.ncbi.nlm.nih.gov To learn more, visit the “Site Map” and “About NCBI” web pages

About NCBI

Some NCBI Statistics….

Users per day 1997 1998 1999 2000 2001 Christmas Day

Molecular Databases Primary Databases Derivative Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly

What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)

NIH NIG EMBL The International Nucleotide Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank Submissions Updates Submissions Updates EMBL DDBJ EBI CIB NIG Submissions Updates SRS EMBL getentry

GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ Release 133 December 2002 22,318,883 Records 28,507,990,166 Nucleotides 110,000 + Species >90 Gigabytes of data

Entrez Nucleotide RefSeq 1% EMBL 9% DDBJ 19% GenBank 71% 23,464,770 records

Primary vs. Derivative Databases ACGTGC Curators C GA ATT GA C GA C ATT GA C RefSeq TATAGCCG Sequencing Centers ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA ACGTGC TTGACA Genome Assembly TATAGCCG CGTGA ATTGACTA ACGTGC TATAGCCG CGTGA CGTGA TATAGCCG ATTGACTA ATTGACTA ATTGACTA ATTGACTA TATAGCCG TTGACA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank GA UniGene AT C C ATT C Algorithms ATT GA ATT GA GA ATT GA C GA ATT GA C GA C ATT GA C C

Traditional GenBank Divisions Direct Submissions (Sequin and BankIt) Accurate Well characterized BCT Bacterial and Archeal INV Invertebrate MAM Mammalian (ex. ROD and PRI) PHG Phage PLN Plant and Fungal PRI Primate ROD Rodent SYN Synthetic (cloning vectors) VRL Viral VRT Other Vertebrate

A Traditional GenBank Record Locus Field Molecule Type Modification Date GenBank Division Definition Line Accession Number Version GI (GenInfo) Keywords Taxonomy

A Traditional GenBank Record

Bulk Sequence Divisions of GenBank Batch Submissions (email and ftp) Inaccurate Poorly Characterized EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTG High Throughput Genomic HTC High Throughput cDNA

Organization of GenBank 11 Traditional Divisions PAT 4% Traditional 8% 1 Patent Division STS, HTG, HTC 2% GSS 19% EST 67% 5 Bulk Divisions 23,087,196 records

What is UniGene? A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents

Organisms Represented in UniGene

Genome Sequencing Whole BAC insert (or genome) shredding sequencing cloning isolating GSS division or trace archive assembly Draft Sequence (HTG division)

Working Draft Sequence gaps

HTG Division: High Throughput Genome phase 1 phase 2 phase 3 ROD Acc = AC109609.1 Acc =AC109609.6 Acc = AC109609.10 HTG

HTG Division: High Throughput Genome

NCBI’s Third Party Annotation (TPA) Database NEW NCBI now accepts the submission of new annotations of existing GenBank sequences; Facilitates the annotation of genomes by experts;

A Sample TPA record

RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records Microbial viral organelle

The RefSeq Accession Numbers mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted Transcript (human, mouse) XP_123456 Predicted Protein (human, mouse) XR_123456 Predicted non-coding RNA Gene Records NG_ 123456 Reference Genomic Sequence (human) Assemblies NT_ 123456 Contig (Mouse and Human) NW_123456 Supercontig (Mouse) NC_ 123456 Chromosome (Microbial,Viral,Arabidopsis ) NR_ 123456 Interim Identifier for Microbial Chromosomes human mouse rat fruit fly zebrafish Arabidopsis

Curated RefSeq Records: NM_, NP_

Entrez: Linking and Neighboring

The Entrez Databases

The (ever) Expanding Entrez System Journals UniGene Books SNP PubMed PubMed Central UniSTS Nucleotide PopSet Protein Entrez ProbeSet Genome Structure Taxonomy CDD 3D Domains OMIM

glucose 6 phosphate dehydrogenase Entrez Nucleotides glucose 6 phosphate dehydrogenase

Document Summaries: glucose 6 phosphate dehydrogenase[All Fields] = 748 hits

Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume glucose 6 phosphate dehydrogenase

Entrez Nucleotides: Preview/Index

Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length . . .

Plant G6PD mRNAs

Display: Formats, Links, and Neighbors Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links UniSTS Links

FASTA definition line >gi|603218|gb|U18238.1|MSU18238 > >gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA FASTA definition line >gi|603218|gb|U18238.1|MSU18238 gi number Database identifiers gb GenBank emb EMBL dbj DDBJ sp SWISS-PROT pdb Protein Databank pir PIR prf PRF ref RefSeq Accession number Locus name >

Entrez Genome

Organism Pages

The Map Viewer: a common platform for integrated display

The Map Viewer

Entrez PubMed

Online Books

Entrez Specialized Databases Taxonomy Searchable taxonomic tree having nodes for all species with records in an Entrez database Online Mendelian Inheritance in Man: A database of genetically linked human diseases OMIM ProbeSet Expression data (GEO) and microarray datasets

Entrez Taxonomy

Entrez OMIM

Entrez ProbeSet

Trace Archive

Entrez Structure

Structure Summary Cn3D viewer Related Structures Conserved Domains

Cn3D: Displaying Structures

Structural Alignment