NCBI FieldGuide NCBI Molecular Biology Resources January 12, 2007 A Field Guide Part 1.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
NCBI Molecular Biology Resources
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases.
NCBI Molecular Biology Resources
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Phage? New Sequence Horizontal Transfer Molecular Evolution.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Bioinformatics for your classroom Seth Bordenstein Discover the Microbes Within! March 12, 2006 NCBI BLAST 1. No programming skills needed 2.Familiarity.
An Introduction to Bioinformatics Molecular Biology Databases.
Introductory Overview
On line (DNA and amino acid) Sequence Information
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) Institute of Biomedical Sciences, Academia Sinica.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
NCBI Molecular Biology Resources February 2007 Part 1.
E-utilities: Short course. The Entrez Query System at NCBI.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
PubChem—Substance, Compound, BioAssay Part 1: Essentials Principles of May 24, 2007.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to Genes and Genomes with Ensembl
Wolbachia Bioinformatics
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Bioinformatics for your classroom
Archives and Information Retrieval
Lesson 3 Bioinformatics Laboratory
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

NCBI FieldGuide NCBI Molecular Biology Resources January 12, 2007 A Field Guide Part 1

NCBI FieldGuide The NCBI Entrez System NCBI Sequence Databases –Primary data: GenBank –Derivative data: RefSeq, Gene Protein Structure and Function Sequence polymorphisms and phenotypes ** Intermission ** NCBI Genomic Resources BLAST NCBI Resources

NCBI FieldGuide The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH –national resource for molecular biology information (biological information direct from organisms) –gather data both nationally and internationally –develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease Bethesda,MD

NCBI FieldGuide Data sources: traditional literature and data obtained from the direct study of organisms The information landscape in biological and medical research has grown far beyond literature to include a wide variety of databases generated by research fields such as molecular biology and genomics. Figure 1 from Geer RC., Broad issues to consider for library involvement in bioinformatics. J Med Libr Assoc Jul; 94(3):286–98. E-152.–5. PMID: Geer RC. NCBI: –accepts submissions of bibliographic records (example) and primary research data (example nucleotide sequence for colon cancer gene, MLH1)example –organizes the information into databases, maintains them, makes them available to the world –develops software to retrieve and analyze the data –conducts basic research to make new biological discoveries using the databases and software tools

NCBI FieldGuide What does NCBI do? NCBI accepts submissions of primary data NCBI develops tools to analyze these data NCBI uses these tools to create derivative databases based on the primary data NCBI provides free search, link, and retrieval of these data, primarily through the Entrez system

NCBI FieldGuide BLAST VAST Entrez Text Sequence Protein Structure Small Mol. Structure PubChem Web Access query

NCBI FieldGuide The NCBI ftp site 30,000 files per day 620 Gigabytes per day

NCBI FieldGuide NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding and Programming Libraries. Examples: BLAST, Cn3D, Sequin, Data format conversion scripts Help for Programmers E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts. Examples: ESearch, EPost, ESummary, EFetch and ELink Caution: Overuse may result in blocked IPs!

NCBI FieldGuide Global Entrez Search Page All[Filter]

NCBI FieldGuide What is Entrez? A system of 31 linked databases A text search engine A tool for finding biologically linked data A retrieval engine A virtual workspace for manipulating large datasets

NCBI FieldGuide Entrez Databases Each record is assigned a UID –unique integer identifier for internal tracking –GI number for Nucleotide Each record is given a Document Summary –a summary of the record’s content (DocSum) Each record is assigned links to biologically related UIDs Each record is indexed by data fields –[author], [title], [organism], and many others

NCBI FieldGuide Linking in Entrez Follow links to related data in the same database or in others! Links Hard Links: Curated links based on biology nucleotide  taxonomy (based on organism identifier) protein  domain relatives (based on domain assignment) domains  pubmed (based on supporting literature) pcsubstance  structures/mmdb (based on source information ) Soft Links: Pre-computed analyses nucleotide  related sequences (BLAST neighbors) protein  conserved domains (CDD/RPS-BLAST search) pccompound  pccompound (structure-based neighboring)

NCBI FieldGuide Genomes Taxonomy Entrez: Database Integration PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Seqs. BLink, Domains Neighbors Related Structures

NCBI FieldGuide Links: Database Integration at NCBI Gene Nucleotide Protein Structure CDD SNP Taxonomy PubMed Homolo- gene mRNAs; genome All CDS products Protein Function SNPs; indels Source organism Literature Gene locus BLASTn CDS product 3D DNA 3D RNA SNPs; indels Source organism Literature Gene locus cDNA transcript BLASTp 3D proteins FunctionSNPs; indels Source organism Literature DNA sequence Protein sequence VAST Protein Function SNP BLASTp Source organism Literature Gene lociProteins with CD 3D templates CDART Broadest taxon Literature Gene locus DNA sequence Protein sequence 3D template Source organism Literature Genes for taxon Seqs for taxon Structs for taxon CD spans Taxon SNPs for taxon Common Tree Gene loci in article Sequence in article Structure in article CDs in article SNPs in article Related articles Nucleotide Protein Structure CDD SNP Taxonomy PubMed

NCBI FieldGuide Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, dbSNP, GEO, PubChem Substance and PubChem Bioassays Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, RefSNP, GEO Datasets, PubChem Compound

NCBI FieldGuide An Entrez Database - Nucleotide GenBank: Primary Data (98.2%) –original submissions by experimentalists –submitters retain editorial control of records –archival in nature RefSeq: Derivative Data (1.8%) –curated by NCBI staff –NCBI retains editorial control of records –record content is updated continually

NCBI FieldGuide Literature Databases

NCBI FieldGuide NM_000249: PubMed Books

NCBI FieldGuide Books Link

NCBI FieldGuide

A part of the NCBI Bookshelf Part 1. The Databases Part 3. Querying and Linking the Data Part 2. Data Flow and Processing Part 4. User Support

NCBI FieldGuide

PubMed Central PubMed Central is a digital archive of life sciences journal literature. Integrated into the Entrez retrieval system, PMC provides free and unrestricted access to the full text of over 160 life sciences journals, with more to come.

NCBI FieldGuide NCBI Journal Database Detailed journal information

NCBI FieldGuide OMIM - A catalogue of genes involved with human disease processes - Detailed clinical and reference information - Curated and maintained by Johns Hopkins - Links to PubMed and sequence databases

NCBI FieldGuide Primary vs. Derivative Databases ACGTGC CGTGA ATTGACTA ACGTGC TTGACA TATAGCCG GenBank Sequencing Centers GA ATT C C GA ATT C C UniGene RefSeq: Gene and Genomes Pipelines RefSeq: Annotation Pipeline Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA Updated ONLY by submitters EST UniSTS STS GSS HTG Updated continually by NCBI PRIRODPLNMAMBCT INVVRTPHGVRL

NCBI FieldGuide What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data –Direct submissions (traditional records ) –Batch submissions (EST, GSS, STS) –ftp accounts (genome data) Three collaborating databases –GenBank –DNA Database of Japan (DDBJ) –European Molecular Biology Laboratory (EMBL) Database

NCBI FieldGuide GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates The International Sequence Database Collaboration Sequin BankIt ftp EBI

NCBI FieldGuide full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ (non-WGS) Release 156October Records Nucleotides >150,000Species 245 Gigabytes 1032 files GenBank Releases

NCBI FieldGuide The Growth of GenBank Non-WGS: 59.8 billion bases WGS: 63.2 billion bases Release 152

NCBI FieldGuide GenBank Divisions PRI Primate ROD Rodent PLN Plant and Fungal BCT Bacterial/Archeal VRT Other Vertebrate INV Invertebrate VRL Viral MAM Mammalian PHG Phage SYN Synthetic UNA Unannotated Direct Submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by taxonomy EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic PAT Patent sequences STS Sequence Tagged Site HTC High Throughput cDNA CON Constructed entries From sequencing projects Batch submissions (ftp/ ) Inaccurate Poorly Characterized Organized by sequence type Traditional Bulk

NCBI FieldGuide Entrez Nucleotide Subsets CoreNucleotide EST GSS TOTAL

NCBI FieldGuide A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // Header Feature Table Sequence The Flatfile Format

NCBI FieldGuide An Example Record – M17755 FieldIndexed Terms [primary accession]M17755 [title]Homo sapiens thyroid peroxidase (TPO) mRNA… [organism]Homo sapiens [sequence length]3060 [modification date]1999/04/26 [properties]biomol mrna gbdiv pri srcdb genbank Indexing for Nucleotide UID

NCBI FieldGuide M17755: Feature Table CDS position in bp TPO [gene name] thyroiditis [text word] thyroid peroxidase [protein name] protein accession

NCBI FieldGuide Sequence: 99.99% Accurate The sequence itself is not indexed… Use BLAST for that!

NCBI FieldGuide Entrez Protein GenPept (DDBJ, EMBL, GenBank) RefSeq Swiss Prot PDB PIR PRF Third Party Annotation 4969 Total

NCBI FieldGuide Protein Sources and Links PIR RefSeq SWISS-PROT GenPept  NM_  M17755 no mRNA!

NCBI FieldGuide Sequence Revisions Version and GI change only if the sequence changes The accession number always retrieves the most recent version First seen at NCBI, not first seen at GenBank!

NCBI FieldGuide Update without a Sequence Change June 15, 1989! GenBank came to NCBI in 1992!

NCBI FieldGuide Update with a Sequence Change

NCBI FieldGuide GenBank File Formats ASN.1 – The Raw Data XML FASTA flat file

NCBI FieldGuide /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, Toolbox Sources ftp> open ftp.ncbi.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox

NCBI FieldGuide Text Queries in Entrez term1[limit] OP term2[limit] OP … limit = Entrez indexing field (organism, author, …) OP = Boolean operator = AND, OR, NOT where term1 term2 Complex queries: ((A[limit1] OR B[limit2]) AND C[limit3]) NOT D[limit4] 1:200[MW] Ranges: Wildcards: cancer[title] vs. cancer*[title]

NCBI FieldGuide Entrez Tabs Limits Provides a simple form for applying commonly used Entrez limits Preview/Index Allows access to the full indexing of each Entrez database and aids in constructing complex queries History Provides access to previous searches in the current Entrez database ClipboardA temporary storage area for selected records DetailsDisplays the detailed parsing of the current Entrez query, and lists errors and terms without matches

NCBI FieldGuide Programming Entrez: E-Utilities ESearch EPost ESummary Entrez query UID list or History Document summaries History UID list or History UID list EFetch Formatted data UID list or History ELink UID list or History

NCBI FieldGuide Finding Primary Sequences Search Entrez CoreNucleotide –94.8% GenBank (primary data) –5.2% RefSeq (curated data) M17755 [primary accession]TPO [gene name] thyroid peroxidase [title]thyroiditis [text word] Homo sapiens [organism]thyroid peroxidase [protein name] 3060 [sequence length]1999/04/26 [modification date] biomol mrna [properties]gbdiv pri [properties] srcdb genbank [properties] Possible queries we’ve seen so far…

NCBI FieldGuide A Starting Query Find nucleotide records for human thyroid peroxidase (("Homo sapiens“[Organism] OR human[All Fields]) AND thyroid peroxidase[All Fields]) human thyroid peroxidase human[organism] AND thyroid peroxidase ("Homo sapiens“[Organism] AND thyroid peroxidase[All Fields]) 276 records 262 records Field Limit! 14 records aren’t human sequences!!

NCBI FieldGuide Limit by Title and Database #1: thyroid peroxidase AND human[orgn] 262 #2: thyroid peroxidase[title] AND human[orgn] 55 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 50 Entrez Nucleotide GenBank srcdb ddbj/embl/genbank[properties] RefSeq srcdb refseq[properties] primary data

NCBI FieldGuide Limit by Biomolecule Type Genomic DNA biomol genomic[prop] cDNA biomol mrna[prop] #1: thyroid peroxidase AND human[orgn] 262 #2: thyroid peroxidase[title] AND human[orgn] 55 #3: #2 AND srcdb refseq[properties] 5 #4: #2 AND srcdb ddbj/embl/genbank[properties] 50 #5: #4 AND biomol genomic[prop] 26 #6: #4 AND biomol mrna[prop] 24 mRNA / cDNA genomic DNA

NCBI FieldGuide Limit by Protein Name thyroid peroxidase[protein name] AND human[orgn] AND gbdiv pri[prop] AND biomol mrna[prop] 24 records [title]  5 records [protein name]

NCBI FieldGuide Entrez Document Summaries Click the accession to view the record Links menu Links to other Entrez databases computed for M17755

NCBI FieldGuide Viewing M17755

NCBI FieldGuide GenBank Sequences for Human TPO Which one is the best sequence???

NCBI FieldGuide Non-redundant Explicitly linked nucleotide and protein sequences Updated to reflect current sequence data and biology Validated by hand Format consistency Distinct accession series Stewardship by NCBI staff and collaborators ftp://ftp.ncbi.nih.gov/refseq/release RefSeq: NCBI’s Derivative Sequence Database RefSeq Benefits

NCBI FieldGuide RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins –NM_  NP_ –NR_ (non-coding RNA) Model transcripts and proteins –XM_  XP_ –XR_ (non-coding RNA) Assembled Genomic Regions (contigs) –NT_ (BAC clones) –NW_ (WGS) Other Genomic Sequence –NG_ (complex regions, pseudogenes) –NZ_ABCD (WGS)  ZP_ Chromosome records in Entrez Genome –NC_ (chromosome; microbial or organelle genome) Nucleotide Protein

NCBI FieldGuide NM/NP Records in Entrez COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from M and AW On Feb 25, 2003 this sequence version replaced gi: NM_000547: variant 1 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from J , AW and M NM_175719: variant 2 EST that completes 3’ end Nucleotide Protein

NCBI FieldGuide Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Annotating the Gene Curated Protein (NP) Scanning.... = ?= ! Genbank Sequences RefSeq

NCBI FieldGuide The Perils of the XM XM records are models based only on genomic sequence, and are subject to revision or removal with each new build of that genome. Query= gi| |ref|XM_ | Mus musculus expressed sequence AA (AA553001), mRNA gi| |ref|NM_ | Mus musculus DNA segment, Chr 4, Wayne State University 114, expressed (D4Wsu114e), mRNA Length=1898 Score = bits (1867), Expect = 0 Identities = 1870/1871 (99%), Gaps = 0/1871 (0%) Strand=Plus/Plus BLAST the XM against the RefSeq database to look for a replacement:

NCBI FieldGuide Entrez Gene and RefSeq Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs) Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases NCBI RefSeqs are based on primary sequence data in GenBank GenBankRefSeq Gene Nucleotide

NCBI FieldGuide Entrez Gene: RefSeq Annotations

NCBI FieldGuide NM/NP Records in Entrez Gene

NCBI FieldGuide Entrez Gene RefSeq Graphics NMNP

NCBI FieldGuide Getting the Annotation Details Genomic sequence ACCESSION NC_ REGION:

NCBI FieldGuide Genome Annotation in Entrez Nucleotide GenBank Components (clones, WGS) NT/NW Contigs NC Assembly Components Genome Components NM/XM Master mRNA

NCBI FieldGuide Genome Annotation Links curated mRNA genomic contig on chromosome 2 transcribing NM_ human chromosome 2 the 18 contigs of the chromosome 2 assembly

NCBI FieldGuide Searching Entrez Gene RefSeq status and variants: Reviewed RefSeqs with transcript variants srcdb refseq reviewed[prop] AND has transcript variants[prop] Gene symbol: human thyroid peroxidase (TPO) tpo [sym] AND human [organism] Disease and Gene Ontology: Membrane proteins linked to cancer integral to plasma membrane[gene ontology] AND cancer [dis] Chromosome and Links: genes on human chromosome 2 with OMIM links 2 [chromosome] AND gene omim [filter] AND human [organism] Protein name: topoisomerase genes from Archaea topoisomerase[gene/protein name] AND archaea [organism]

NCBI FieldGuide Examples of sequences appropriate for TPA are: Annotation of features on gene and/or mRNA sequences Assembled “full length” genes and/or mRNAs NCBI now accepts the submission of new annotations of existing GenBank sequences. Submissions must be published in a peer-reviewed journal. Facilitates the annotation of sequences by experts. What should not be submitted to TPA? Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators Updates or changes to existing sequence data Sequence annotations without experimental evidence Third Party Annotation (TPA) Database

NCBI FieldGuide Linking Protein Sequence, Structure, and Function sequence  function (pfam, smart) Conserved Domains (CDD) sequence  structure + function (cd) VAST Structure (MMDB) sequence  structure structure  structure Protein sequence  sequence

NCBI FieldGuide Entrez Structure Derived from experimentally determined PDB records Add value to PDB records by: –Adding explicit chemical bonding information –Validating and indexing the sequences –Annotating 3D domains and secondary structure –Adding links to CDD, Taxonomy, Pubmed –Converting PDB data to ASN.1 Structure neighbors determined by Vector Alignment Search Tool (VAST) MM MMDB: Molecular Modeling Data Base Structure

NCBI FieldGuide Structure Summary Page Conserved Domains VAST Neighbors for chain C (domain 0) Cn3D VAST Neighbors for domain 2

NCBI FieldGuide Related Structures

NCBI FieldGuide VAST: Structure Neighbors Vector Alignment Search Tool For each 3D domain, locate SSEs (secondary structure elements), and represent them as individual vectors Human IL-4 VAST uses 3D Domains only! Whole polypeptides are assigned 3D domain 0 (zero).

NCBI FieldGuide VAST Neighbors 1D2V 1Q4G 3D domains!   Cn3D

NCBI FieldGuide Submitting a PDB File to VAST Redesigned interface! This is the best way to convert PDB into MMDB format! New!

NCBI FieldGuide Structure + Function VAST finds proteins that have similar 3D folds CD-Search finds proteins that have similar sequences and similar functions Curated CDs = VAST + CD-Search Proteins that have similar 3D folds, similar sequences and similar functions

NCBI FieldGuide Protein Links: Domains Click on a colored bar to align your sequence to the CD

NCBI FieldGuide CDD Record – heme peroxidases aligned query red = high conservation blue = low conservation

NCBI FieldGuide Curated CD Record - EGF Annotated features Launch Cn3D phylogenetic tree of aligned sequences Launch CDTree New

NCBI FieldGuide Curated CD Record - EGF Annotated features Launch Cn3D phylogenetic tree of aligned sequences Launch CDTree New Cn3D

NCBI FieldGuide Entrez PubChem PC Substance PC Compound PC BioAssay Primary database of chemical samples Derived database of known chemicals from PC Substance records Primary database of bioactivity screens of samples in PC Substance

NCBI FieldGuide Links from Structure N-acetylglucosamine heme mannose fucose

NCBI FieldGuide Sequence Polymorphisms SNPOMIM Primary database of submitted SNPs Curated database of reference SNPs Contains more than just SNPs: True SNPs MNP (multiple nucleotide) Insertions Deletions Microsatellites Mixed No variation (constant) Clinical literature database Curated at Johns Hopkins Univ Links human genes and genetic disorders to human disease Lists allelic variants that have clinical consequences Variations in SNP are not necessarily in OMIM, and vice versa! General PolymorphismsHuman Phenotypes

NCBI FieldGuide Linking to SNP Links to SNP are also available from Nucleotide and Protein Entrez Gene - TPO

NCBI FieldGuide Entrez SNP primary data: ss# SNP UID: rs#

NCBI FieldGuide Find Non-synonymous SNPs #7 AND coding nonsynon[Function Class] Function Class

NCBI FieldGuide Non-synonymous TPO SNPs Link to Map Viewer View all SNPs in locus Link to related 3D structures

NCBI FieldGuide GeneView in dbSNP

NCBI FieldGuide Links to OMIM Entrez Gene - TPO

NCBI FieldGuide OMIM Record

NCBI FieldGuide Explore a Disease SNP 799

NCBI FieldGuide Curated CD Record Launch Cn3D phylogenetic tree of aligned sequences Launch CDTree Cn3D E799