NCBI Molecular Biology Resources

Slides:



Advertisements
Similar presentations
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Advertisements

Part I: Tips and Techniques from curators GBrowse at TAIR David Swarbreck.
Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
NCBI Molecular Biology Resources
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Phage? New Sequence Horizontal Transfer Molecular Evolution.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Bioinformatics for your classroom Seth Bordenstein Discover the Microbes Within! March 12, 2006 NCBI BLAST 1. No programming skills needed 2.Familiarity.
Introductory Overview
The Ensembl Gene set The “Genebuild” 21 April 2008.
Gene Expression Omnibus (GEO)
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
Genomes School B&I TCD Bioinformatics May Genome sizes Completed eukaryotic nuclear genomes Type of organismSpeciesGenome size (10 6 base pairs)
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) Institute of Biomedical Sciences, Academia Sinica.
NCBI FieldGuide NCBI Molecular Biology Resources January 12, 2007 A Field Guide Part 1.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Introduction to Bioinformatics Introduction to Databases
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Opportunities & Challenges in Applying IR Techniques to Bioinformatics ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Gene Expression Omnibus (GEO)
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
NCBI Molecular Biology Resources February 2007 Part 1.
E-utilities: Short course. The Entrez Query System at NCBI.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Introduction to Genes and Genomes with Ensembl
Wolbachia Bioinformatics
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Bioinformatics for your classroom
Archives and Information Retrieval
Genomes and Their Evolution
Part I: Tips and Techniques from curators
Presentation transcript:

NCBI Molecular Biology Resources A Field Guide August 2-3, 2005 University of Massachusetts

NCBI Resources The NCBI Entrez System NCBI Sequence Databases Primary data: GenBank Derivative data: RefSeq, Gene, Genome Beyond Refseq: UniGene, Trace Archive NCBI Genomic Resources ** Intermission ** BLAST Protein Structure and Function Sequence polymorphisms and phenotypes

The National Institutes of Health Bethesda, MD

The National Center for Biotechnology Information Created as a part of NLM in 1988 Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information

Web Access Text Entrez Sequence BLAST Structure VAST

NCBI Web Traffic User’s per day Christmas and New Year’s Day

The NCBI ftp site 30,000 files per day 620 Gigabytes per day

What does NCBI do? NCBI accepts submissions of primary data NCBI develops tools to analyze these data NCBI uses these tools to create derivative databases based on the primary data NCBI provides free search, link, and retreival of these data, primarily through the Entrez system

Types of Databases Primary Databases Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, SNP, GEO, PubChem Substance Derivative Databases Built from primary data Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound Primary databases serve as a repository of experimentalist sequences (GenBank). Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)

Primary vs. Derivative Databases C GA ATT GA C GA C ATT GA UniGene C Algorithms TATAGCCG Sequencing Centers ACGTGC ATTGACTA ACGTGC CGTGA TTGACA UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT ACGTGC RefSeq: LocusLink and Genomes Pipelines Curators TATAGCCG AGCTCCGATA CCGATGACAA Labs

What is Entrez? A system of 29 linked databases A text search engine A tool for finding biologically linked data A retrieval engine A virtual workspace for manipulating large datasets

The Entrez System: Text Searches

Entrez Databases Each record is assigned a UID unique integer identifier for internal tracking GI number for Nucleotide Each record is given a Document Summary a summary of the record’s content (DocSum) Each record is assigned links to biologically related UIDs Each record is indexed by data fields [author], [title], [organism], and many others

Entrez Taxonomy The backbone of NCBI [organism]

An Entrez Database - Nucleotide GenBank: Primary Data (97.9%) original submissions by experimentalists submitters retain editorial control of records archival in nature RefSeq: Derivative Data (2.1%) curated by NCBI staff NCBI retains editorial control of records record content is updated continually

Entrez Nucleotide Primary Data DDBJ / EMBL / GenBank 56,865,268 Derivative Data RefSeq 1,226,084 PDB 5,973 Third Party Annotation 4,650 Total 58,101,975

What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database

The International Sequence Database Collaboration NIH Entrez Sequin BankIt ftp NCBI GenBank Submissions Updates Submissions Updates EMBL DDBJ EBI CIB NIG Submissions Updates SRS EMBL getentry

ftp://ftp.ncbi.nih.gov/genbank/ GenBank Releases Release 148 June 2005 45,236,251 Records 49,398,852,122 Nucleotides >140,000 Species 172 Gigabytes 785 files full release every two months incremental and cumulative updates daily available only through internet GenBank, as a product, is treated like a software product with releases (full updates) every ~2 months. Originally it was put out on CDs, but eventually became much to large to fit, so an FTP site was set up to provide access to continually updated files. ftp://ftp.ncbi.nih.gov/genbank/

The Growth of GenBank Release 148: 45.2 million records 49.4 billion nucleotides Average doubling time ≈ 14 months* Doubling time is currently less than 1 year and still accelerating.

GenBank Divisions Traditional Bulk PRI (28) Primate ROD (14) Rodent PLN (13) Plant and Fungal BCT (10) Bacterial/Archeal INV (7) Invertebrate VRT (7) Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1) Unannotated Traditional Direct Submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by taxonomy Bulk EST (349) Expressed Sequence Tag GSS (120) Genome Survey Sequence HTG (62) High Throughput Genomic HTC (6) High Throughput cDNA STS (5) Sequence Tagged Site From sequencing projects Batch submissions (ftp/email) Inaccurate Poorly Characterized Organized by sequence type

A Traditional GenBank Record LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // Header The Flatfile Format Feature Table Sequence

An Example Record – M17755 Field Indexed Terms Indexing for Nucleotide UID 4680720 Field Indexed Terms [primary accession] M17755 [title] Homo sapiens thyroid peroxidase (TPO) mRNA… [organism] Homo sapiens [sequence length] 3060 [modification date] 1999/04/26 [properties] biomol mrna gbdiv pri srcdb genbank

M17755: Feature Table TPO [gene name] CDS position in bp thyroiditis [text word] thyroid peroxidase [protein name] protein accession

Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!

Entrez Protein GenPept (DDBJ, EMBL, GenBank) 4,444,405 RefSeq 1,753,167 PIR 222,395 Swiss Prot 189,005 PDB 68,621 PRF 12,079 Third Party Annotation 4,219 Total 6,693,891

Protein Sources and Links PIR no mRNA! RefSeq  NM_000537 SWISS-PROT no mRNA! GenPept  M17755

First seen at NCBI, not first seen at GenBank! Sequence Revisions First seen at NCBI, not first seen at GenBank! Version and GI change only if the sequence changes The accession number always retrieves the most recent version

Update without a Sequence Change June 15, 1989! GenBank came to NCBI in 1992!

Update with a Sequence Change

GenBank File Formats ASN.1 – The Raw Data flat file XML (4 flavors) FASTA

NCBI Toolbox Toolbox Sources ftp> open ftp.ncbi.nih.gov . /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. **************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> #ifdef ENABLE_ID1 #include <accid1.h> #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, Toolbox Sources ftp> open ftp.ncbi.nih.gov . ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools

Text Searches in Entrez term1 term2 If no [limit] is specified… Organism?  [ organism ] Journal?  [ journal ] User compounds?  search as phrase Author?  [author] else [All Fields] term1[limit] OP term2[limit] OP … where limit = Entrez indexing field (organism, author, …) op = AND, OR, NOT

Entrez Tabs Limits Provides a simple form for applying commonly used Entrez limits Allows access to the full indexing of each Entrez database and aids in constructing complex queries Preview/Index History Provides access to previous searches in the current Entrez database Clipboard A temporary storage area for selected records Details Displays the detailed parsing of the current Entrez query, and lists errors and terms without matches

Programming Entrez: E-Utilities http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html ESearch Entrez query UID list or History ESummary UID list or History Document summaries EFetch UID list or History Formatted data UID list or History ELink UID list or History EPost UID list History

Finding Primary Sequences Search Entrez Nucleotide 97.9% GenBank (primary data) 2.1% RefSeq (curated data) Possible queries we’ve seen so far… M17755 [primary accession] TPO [gene name] thyroid peroxidase [title] thyroiditis [text word] Homo sapiens [organism] thyroid peroxidase [protein name] 3060 [sequence length] 1999/04/26 [modification date] biomol mrna [properties] gbdiv pri [properties] srcdb genbank [properties]

A Starting Query 309 records 298 records Find nucleotide records for human thyroid peroxidase 309 records human thyroid peroxidase (("Homo sapiens“[Organism] OR human[All Fields]) AND thyroid peroxidase[All Fields]) Field Limit! human[organism] AND thyroid peroxidase 298 records ("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields]) 11 records aren’t human sequences!!

Limit by Title and Database Entrez Nucleotide GenBank srcdb ddbj/embl/genbank[properties] RefSeq srcdb refseq[properties] #1: thyroid peroxidase AND human[orgn] 298 #2: thyroid peroxidase[title] AND human[orgn] 169 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 164 primary data

Limit by Genbank Division EST Division gbdiv est[prop] Primate Division gbdiv pri[prop] #1: thyroid peroxidase AND human[orgn] 298 #2: thyroid peroxidase[title] AND human[orgn] 169 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 164 #5: #4 AND gbdiv est[prop] 20 #6: #4 AND gbdiv pri[prop] 144 traditional GenBank records

Limit by Biomolecule Type Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop] #1: thyroid peroxidase AND human[orgn] 298 #2: thyroid peroxidase[title] AND human[orgn] 169 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 164 #5: #2 AND gbdiv est[prop] 20 #6: #2 AND gbdiv pri[prop] 144 #7: #6 AND biomol genomic[prop] 26 #8: #6 AND biomol mrna[prop] 118 genomic DNA mRNA / cDNA

Limit by Protein Name thyroid peroxidase[protein name] AND human[orgn] AND gbdiv pri[prop] AND biomol mrna[prop] 118 records [title]  4 records [protein name]

Entrez Document Summaries Links menu Click the accession to view the record Links to other Entrez databases computed for M17755

Entrez Links for GI 4680720 Gene annotation based on M17755 Full text online articles about M17755 All polymorphisms in the TPO gene DNA/RNA sequences similar to M17755 Graphical view of TPO gene annotation Human phenotypes involving TPO Microarray datasets for M17755 Protein translation of M17755 Literature abstracts about M17755 Sequence polymorphisms in M17755 Source organism of M17755 STS markers in the TPO gene TPO links beyond NCBI

Viewing M17755

GenBank Sequences for Human TPO Which one is the best sequence???

RefSeq: NCBI’s Derivative Sequence Database RefSeq Benefits Non-redundant   Explicitly linked nucleotide and protein sequences Updated to reflect current sequence data and biology Validated by hand Format consistency Distinct accession series Stewardship by NCBI staff and collaborators ftp://ftp.ncbi.nih.gov/refseq/release

RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins NM_123456  NP_123456 NR_123456 (non-coding RNA) Model transcripts and proteins XM_123456  XP_123456 XR_123456 (non-coding RNA) Assembled Genomic Regions (contigs) NT_123456 (BAC clones) NW_123456 (WGS) Other Genomic Sequence NG_123456 (complex regions, pseudogenes) NZ_ABCD12345678 (WGS)  ZP_123456 Chromosome records in Entrez Genome NC_123456 (chromosome; microbial or organelle genome) Nucleotide Protein

Creating NM Records NMs must have cDNA support Genome annotation Longest mRNA NMs must have cDNA support

NM/NP Records in Entrez NM_000547: variant 1 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from M17755.2 and AW874082.1. On Feb 25, 2003 this sequence version replaced gi:21361188. EST that completes 3’ end NM_175719: variant 2 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from J02970.1, AW874082.1 and M17755.2. Nucleotide Protein

= ! = ? Annotating the Gene RefSeq Genbank Sequences Genomic DNA (NC, NT, NW) Scanning.... Model mRNA (XM) (XR) Model protein (XP) = ! = ? Curated mRNA (NM) (NR) Curated Protein (NP) RefSeq Genbank Sequences

Entrez Gene and RefSeq Gene GenBank RefSeq Nucleotide Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs) Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases NCBI RefSeqs are based on primary sequence data in GenBank

Entrez Gene: RefSeq Annotations

NM/NP Records in Entrez Gene

Entrez Gene RefSeq Graphics NM NP

What about LOC440844? Entrez Gene

BLAST Results for XM_496543 Is there any GenBank support for this mRNA? srcdb ddbj/embl/genbank[prop] AND biomol mrna[prop] no full-length hit

The Perils of the XM XM records are models based only on genomic sequence, and are subject to revision or removal with each new build of that genome. BLAST the XM against the RefSeq database to look for a replacement: Query= gi|20850420|ref|XM_124429.1| Mus musculus expressed sequence AA553001 (AA553001), mRNA gi|19527087|ref|NM_133873.1|      Mus musculus DNA segment, Chr 4, Wayne State University 114, expressed (D4Wsu114e), mRNA Length=1898 Score = 3701.55 bits (1867), Expect = 0 Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus

Eukaryotic NM/XM Records Bos taurus: 37541 Oryza sativa (japonica cultivar-group): 36836 Danio rerio: 30577 Homo sapiens: 29261 Arabidopsis thaliana: 28953 Mus musculus: 27033 Rattus norvegicus: 23975 Pan troglodytes: 21810 Caenorhabditis elegans: 21124 Drosophila melanogaster: 19412 Aspergillus nidulans FGSC A4: 18951 Gallus gallus: 18120 Canis familiaris: 16891 Anopheles gambiae str. PEST: 15328 Plasmodium chabaudi: 14747 Candida albicans SC5314: 13672 Dictyostelium discoideum: 13570 Ustilago maydis 521: 13044 Plasmodium berghei: 11778 Gibberella zeae PH-1: 11640 Magnaporthe grisea 70-15: 11109 Neurospora crassa: 10079 Aspergillus fumigatus Af293: 9923 Entamoeba histolytica HM-1:IMSS: 9772 Cryptococcus neoformans var. neoformans JEC21: 6594 Giardia lamblia ATCC 50803: 6569 Yarrowia lipolytica CLIB99: 6521 Debaryomyces hansenii CBS767: 6318 Apis mellifera: 6292 Kluyveromyces lactis NRRL Y-1140: 5327 Candida glabrata CBS138: 5181 Schizosaccharomyces pombe 972h-: 5035 Eremothecium gossypii: 4718 Theileria parva: 4079 Xenopus tropicalis: 4069 Cryptosporidium hominis: 3886 Cryptosporidium parvum: 3396 Sus scrofa: 938 Trypanosoma brucei: 599 Ovis aries: 253 Strongylocentrotus purpuratus: 215 Felis catus: 162 Plasmodium yoelii yoelii: 105 Takifugu rubripes: 7 Ciona intestinalis: 3 Trypanosoma cruzi: 3

Genome Annotation in Entrez Nucleotide GenBank Components (clones, WGS) NT/NW Contigs NC Genome Assembly NM/XM Master mRNA Components Components

Genome Annotation Links curated mRNA genomic contig on human chromosome 2 containing NM_000547 human chromosome 2 the 21 contigs of the chromosome 2 assembly

Getting the Annotation Details Genomic sequence ACCESSION NC_000002 REGION: 1396242..1525502

Getting the Annotation Details ACCESSION NC_000002 REGION: 1396242..1525502 exon-intron structure These flat files contain all annotations in the gene and the full, explicit sequence

Searching Entrez Gene Gene symbol: human thyroid peroxidase (TPO) tpo [sym] AND human [organism] Protein name: topoisomerase genes from Archaea topoisomerase[gene/protein name] AND archaea [organism] Chromosome and Links: genes on human chromosome 2 with OMIM links 2 [chromosome] AND gene omim [filter] AND human [organism] RefSeq status and variants: Reviewed RefSeqs with transcript variants srcdb refseq reviewed[prop] AND has transcript variants[prop] Disease and Gene Ontology: Membrane proteins linked to cancer integral to plasma membrane[gene ontology] AND cancer [dis]

Gene Links in Entrez Microarray datasets for TPO Gene homologs for TPO DNA and RNA sequences for TPO Phenotypes involving TPO Protein sequences for TPO Literature abstracts about TPO Sequence polymorphisms in TPO Species whose genome has this TPO gene STS markers in the TPO gene ESTs aligned to the TPO gene

Third Party Annotation (TPA) Database NCBI now accepts the submission of new annotations of existing GenBank sequences. Submissions must be published in a peer-reviewed journal. Facilitates the annotation of sequences by experts. Examples of sequences appropriate for TPA are: Annotation of features on gene and/or mRNA sequences Assembled “full length” genes and/or mRNAs What should not be submitted to TPA? Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators Updates or changes to existing sequence data Sequence annotations without experimental evidence

Beyond RefSeq If your organism does not have RefSeqs… UniGene : gene-based clusters of cDNAs and ESTs WGS sequences in Entrez Nucleotide (wgs[prop]) Trace Archive

What is UniGene? A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery and selection of mapping reagents Clusters of ESTs based on automatic similarity. Each cluster represents a gene.

Organisms in UniGene Top Ten 1. Human 2. Rice 3. Mouse 4. Cow 5. Wheat 6. Zebrafish 7. Pig 8. Chicken 9. Frog (X. laevis) 10. Frog (X. tropicalis)

Finding UniGene Clusters by link by Entrez search

UniGene Cluster for TPO

GPL GSM GSE GDS Entrez GEO Datasets Entrez GEO Submitted by Experimentalists Submitted by Manufacturer* Curated by NCBI GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments Entrez GEO Datasets Entrez GEO

Linking to GEO

GEO Datasets

Whole Genome Shotgun Projects Traditional GenBank Divisions 300 + projects Viruses Bacteria Environmental sequences Archaea 73 Eukaryotes featuring: Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human Pufferfish (2), Zebrafish Honeybee, Anopheles, Fruit Flies (4), Silkworm Nematode (C. briggsae) Yeasts (9), Aspergillus (3) Rice WGS- preliminary way to getting a whole genome. WGS sequences go into traditional GenBank divisions.

Trace Archive

Short-tailed opossum traces

Viewing Simple Genomes All are RefSeq NC records in Entrez Genome Full chromosomal sequences are provided Genes are annotated The annotation can be shown graphically and linked to sequence records

mutL

Viewing Complex Genomes NCBI Map Viewer Map Viewer Home Page Shows all supported organisms Provides links to genomic BLAST Genome Overview Page Provides links to individual chromosomes Shows hits on a genome graphically Chromosome Viewing Page Allows interactive views of annotation details Provides numerous maps unique to each genome

Map Viewer Home Page

Species-specific help! Genome Overview Page Search the maps Genomic BLAST Species-specific help!

Chromosome Viewing Page Map Summary Add or remove maps Master Map with exploded content Genes UniGene Contigs Zooming Controls Ideogram

Map Summary TPO’s contig!

Map Content Sequence Maps Genetic Maps Core assembly Map content varies greatly by species! Sequence Maps Core assembly Annotation evidence Clones & Markers Polymorphisms Links & Features Genetic Maps Cytogenetic maps Linkage maps Radiation hybrid maps Assembly Contig Component Transcript Gene

View the Assembly near TPO

Assembly of Chr. 2 NT_033000 1255072 1563756

Assembly of Chromosome 2

Zooming

View of TPO Links to Entrez Nucleotide Links to Entrez Gene Links to Tools and Data Gap in assembly

Map Content Sequence Maps Genetic Maps Core assembly Map content varies greatly by species! Sequence Maps Core assembly Annotation evidence Clones & Markers Polymorphisms Links & Features Genetic Maps Cytogenetic maps Linkage maps Radiation hybrid maps Ab initio (model) GenBank DNA EST UniGene Gene

GenBank records not used in assembly Annotation Evidence GenBank records not used in assembly UniGene Clusters Ab initio models Aligned ESTs

Entrez Homologene Homologs by protein BLAST