Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.

Similar presentations


Presentation on theme: "NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources."— Presentation transcript:

1 NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources

2 NCBI FieldGuide Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, GEO Datasets, NCBI Protein, Structure, Conserved Domain

3 NCBI FieldGuide Accessing the Data: Entrez all[filter]

4 NCBI FieldGuide EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration

5 NCBI FieldGuide GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank Release 142June 2004 35,532,003Records 40,325,321,348Nucleotides >140,000Species 153 Gigabytes 634 files full release every two months incremental and cumulative updates daily available only through internet release notes: gbrel.txt

6 NCBI FieldGuide A GenBank Record LOCUS NM_000588 924 bp mRNA linear PRI 07-APR-2003 DEFINITION Homo sapiens interleukin 3 (colony-stimulatingfactor, multiple)(IL3), mRNA. ACCESSION NM_000588 VERSION NM_000588.3 GI:28416914 KEYWORDS.

7 NCBI FieldGuide GenBank Record: Feature Table /protein_id=“ NP_000579.2 ” /db_xref=“GI:28416915 GenPept identifiers

8 NCBI FieldGuide GenBank Record, Con’t

9 NCBI FieldGuide Sequence Revision History

10 NCBI FieldGuide NM_000588 Sequence Revision History: choose records

11 NCBI FieldGuide Display and Save Options

12 NCBI FieldGuide FASTA format (NCBI)

13 NCBI FieldGuide Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1

14 NCBI FieldGuide Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch submissions (email and ftp) Inaccurate Poorly characterized

15 NCBI FieldGuide NCBI’s Derivative Sequence Databases

16 NCBI FieldGuide Primary vs. Derivative Databases GenBank Sequencing Centers UniGene RefSeq: LocusLink and Genomes Pipelines RefSeq: Annotation Pipeline Labs Algorithms Updated ONLY by submitters EST UniSTS STS GSS HTG PRIRODPLNMAMBCT INVVRTPHGVRL Curators ATT GA ATT C GA C C C C ATT TA ACT Updated continually by NCBI RefSeq

17 NCBI FieldGuide Entrez Protein query: topoisomerase II alpha[title] AND human[organism] Why Make Reference Sequences? = AAC77388 splice variant Δ = 5 aa = P11388 RefSeq protein

18 NCBI FieldGuide RefSeq Benefits non-redundant, best representative updates to reflect current sequence data and biology distinct, stable accession series genomes transcripts proteins

19 NCBI FieldGuide Reference Sequence: RefSeq AccessionSequence Type NM_123456789 mRNA NP_123456789 protein, from NM_ NR_123456 non-coding RNA XM_123456 predicted mRNA XP_123456 predicted protein XR_123456 predicted non-coding RNA ZP_12345678 predicted from NZ_ NC_123456 genomic, e.g., chromosomes NG_123455 genomic, incomplete region NT_123456 genomic, BAC assembly NW_123456 genomic, WGS assembly NZ_ABCD12345678 genomic, WGS collection blue=curated REFSEQ Key

20 NCBI FieldGuide RefSeq Status Codes REVIEWED: by NCBI staff or by a collaborator. Some RefSeq records may incorporate expanded sequence and annotation information including additional publications and features. VALIDATED: in an initial review to provide the preferred sequence standard; not yet subjected to final review at which time additional functional information may be provided. PROVISIONAL: the record has not yet been subject to individual review and is thought to be well supported and to represent a valid transcript and protein. PREDICTED: may represent an ab initio prediction or may be partially supported by other transcript data; the protein is predicted. INFERRED: by genome sequence analysis. MODEL: provided via automated processing and not subjected to individual review or revision between builds.

21 NCBI FieldGuide Third Party Annotation (TPA) Database Annotations of existing GenBank sequences Allows for community annotation of genomes Direct submissions –BankIt –Sequin

22 NCBI FieldGuide Other Databases at the NCBI dbSNP nucleotide polymorphisms GEO Gene Expression Omnibus microarray and other expression data GEO DataSets curated reports of GEO data collections of biologically and mathematically comparable GEO Samples. Structure imported structures (PDB) Cn3D viewer, NCBI curation CDD conserved domain database protein families (COGs and KOGs) single domains (PFAM, SMART, CD)

23 NCBI FieldGuide NCBI’s SNP Database Primary and derivative (RefSNP) Single nucleotide polymorphisms Repeat polymorphisms Insertion-deletion polymorphisms 24 Species Over 11 million refSNPs (rsXXXXXXX)

24 NCBI FieldGuide Non-redundant Computational Analysis BLAST hits to genome, mRNA, protein RefSNP

25 NCBI FieldGuide Using Entrez An integrated database search and retrieval system

26 Genomes Taxonomy Entrez: Database Integration PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny

27 NCBI FieldGuide Home Page: Global Entrez Portal hfe

28 NCBI FieldGuide Global Entrez Search: HFE

29 NCBI FieldGuide Entrez Nucleotide: HFE 218 records Not HFE [Title]

30 NCBI FieldGuide Smarter Query hfe[title] AND human[orgn] 39 records Curated HFE splice variants (11 total)

31 NCBI FieldGuide hfe[title] AND human[orgn] (con’t) Primary data

32 NCBI FieldGuide Finding Primary Sequences Entrez Nucleotide 99+% GenBank (primary data) –srcdb ddbj/embl/genbank[properties]= 39,849,856 records <1% RefSeq (curated data) –srcdb refseq[properties]= 304,945 records Useful search terms in [Properties]: – srcdb : source database (e.g., srcdb genbank[prop]) – gbdiv : GenBank division (e.g., gbdiv est[prop]) – biomol : biomolecule type (e.g., biomol mrna[prop])

33 NCBI FieldGuide Database Queries #1 hfe 116 #2 hfe[title] AND human[orgn] 42 #3 #2 AND srcdb refseq[prop] 11 #4 #2 AND srcdb ddbj/embl/genbank[prop] 31 #5 #2 AND gbdiv pri[prop] 29 #4 #2 AND gbdiv est[prop] 2 Primate divisiongbdiv pri[prop] EST divisiongbdiv est[prop]

34 NCBI FieldGuide Molecule Queries #1 hfe 116 #2 hfe[title] AND human[orgn] 42 #3 #2 AND biomol mrna[prop] 29 #4 #2 AND biomol genomic[prop] 13 Genomic DNAbiomol genomic[prop] cDNAbiomol mrna[prop]

35 NCBI FieldGuide More Queries… RefSeq status, variants: reviewed RefSeqs with transcript variants srcdb refseq reviewed[prop] AND has transcript variants[prop] Gene symbol: human hemochromatosis (HFE) hfe[sym] AND human[organism] Disease and Gene Ontology: membrane proteins linked to cancer integral to plasma membrane[gene ontology] AND cancer[dis] Chromosome, Links: genes on human chromosome 2 with OMIM links 2[chromosome] AND gene omim[filter] AND human[organism] Protein name: topoisomerase genes from Archaea topoisomerase[gene/protein name] AND archaea[organism]

36 NCBI FieldGuide Other Entrez Databases UniSTS: markers on the Genethon map of human chromosome 12 Genethon[Map Name] AND human[organism] AND 12[chromosome] UniGene: rat clusters that have at least one mRNA rat[organism] NOT 0[mrna count] Structure: structures of bacterial kinases with resolutions below 2 Å bacteria[organism] AND kinase AND 000.00:002.00[resolution] SNP: uniquely mapped microsatellites on human chr2 microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND human[orgn]

37 NCBI FieldGuide Search by Sequence

38 NCBI FieldGuide Related Sequences Most similar Least similar

39 NCBI FieldGuide Search by Sequence: protein

40 NCBI FieldGuide BLink (BLAST Link)

41 NCBI FieldGuide BLink Output

42 NCBI FieldGuide BLink → Multiple sequence alignment


Download ppt "NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources."

Similar presentations


Ads by Google