Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCBI Molecular Biology Resources February 2007 Part 1.

Similar presentations


Presentation on theme: "NCBI Molecular Biology Resources February 2007 Part 1."— Presentation transcript:

1 NCBI Molecular Biology Resources February 2007 Part 1

2 The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Bethesda,MD

3 Web Access: www.ncbi.nlm.nih.gov

4 NCBI Databases and Services GenBank largest sequence database Free public access to biomedical literature –PubMed free Medline –PubMed Central full text online access Entrez integrated molecular and literature databases BLAST highest volume sequence search service VAST structure similarity searches Software and Databases

5 Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain

6 Entrez Nucleotides Primary GenBank / EMBL / DDBJ 86,011,283 Derivative RefSeq 1,512,656 Third Party Annotation 5,254 PDB 7,261 Total 87,536,454

7 What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature –Historical –Reflective of submitter point of view (subjective) –Redundant GenBank Data –Direct submissions (traditional records) –Batch submissions (EST, GSS, STS) –ftp accounts (genome data) Three collaborating databases –GenBank –DNA Database of Japan (DDBJ) –European Molecular Biology Laboratory (EMBL) Database

8 EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration

9 GenBank: NCBI’s Primary Sequence Database ftp://ftp.ncbi.nih.gov/genbank/ Release 157December 2006 83,434,665Records 150,630,667,561Total Bases 254 Gigabytes (non-WGS) 1072 files (non-WGS) full release every two months incremental updates daily available only via ftp full release every two months incremental updates daily available only via ftp

10 The Growth of GenBank Non-WGS: 69.0 billion bases WGS: 81.6 billion bases Release 157 Doubling time 12-14 months

11 Organization of GenBank: Traditional Divisions Records are divided into 18 Divisions. 12 Traditional 6 Bulk Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized PRI Primate PLN Plant and Fungal BCT Bacterial and Archeal INV Invertebrate ROD Rodent VRL Viral VRT Other Vertebrate MAM Mammalian PHG Phage SYN Synthetic (cloning vectors) ENV Environmental Samples UNA Unannotated Entrez query: gbdiv_xxx[Properties]

12 Organization of GenBank: Bulk Divisions Records are divided into 18 Divisions. 12 Traditional 6 Bulk BULK Divisions: Batch Submission (Email and FTP) Inaccurate Poorly characterized EST Expressed Sequence Tag GSS Genome Survey Sequence HTG High Throughput Genomic STS Sequence Tagged Site HTC High Throughput cDNA PAT Patent Entrez query: gbdiv_xxx[Properties]

13 A Traditional GenBank Record LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // Header Feature Table Sequence The Flatfile Format

14 Traditional GenBank Record ACCESSION U07418 VERSION U07418.1 GI:466461 ACCESSION U07418 VERSION U07418.1 GI:466461 Accession Stable Reportable Universal Accession Stable Reportable Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use well annotated the sequence is the data

15 Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch Submission and htg (email and ftp) Inaccurate Poorly Characterized

16 GenBank Bulk Sequence: EST poorly characterized poorly characterized

17 ESTs in Entrez Total 41 million records Human 7.9 million Mouse 4.7 million Cow1.3 million Rice1.2 million Zebrafish 1.2 million Maize1.2 million Xenopus tropicalis1.0 million Rat 0.9 million Wheat0.9 million Chicken0.6 million Barley0.4 million Total 41 million records Human 7.9 million Mouse 4.7 million Cow1.3 million Rice1.2 million Zebrafish 1.2 million Maize1.2 million Xenopus tropicalis1.0 million Rat 0.9 million Wheat0.9 million Chicken0.6 million Barley0.4 million

18 HTG Division: Opossum Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division

19 Whole Genome Shotgun Projects ftp://ftp.ncbi.nih.gov/genbank/wgs/ >450 Projects >400 Taxa –302 bacteria –128 eukaryotes 47 fungi 53 animals 3 flowering plants >450 Projects >400 Taxa –302 bacteria –128 eukaryotes 47 fungi 53 animals 3 flowering plants

20 Mammalian WGS Duck-billed platypus Nine-banded armadillo Northern tree shrew Domestic rabbit Guinea pig Mouse Rat Thirteen-lined ground squirrel Small-eared galago Human Chimpanzee Rhesus macaque Tenrec African elephant Cat Dog European hedgehog Eurasian shrew Cow Little brown bat Gray short-tailed opossum

21 Derivative Databases

22 Entrez Protein: Derivative Database Data Source GenPept Sequences 6,749,369 RefSeq 3,261,525 Third Party Annotation 5,079 Swiss Prot 243,887 PIR 30,236 PRF 12,079 PDB 89,953 PAT Division 669,035 Total 10,392,118 BLAST nr total (no patents or env) 4,180,857

23 FEATURES Location/Qualifiers source 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" gene 1..2484 /gene="MLH1" CDS 22..2292 /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS GenPept: GenBank CDS translations >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

24 Redundant Proteins >gi|741682|prf||2007430A DNA mismatch repair protei... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|4557757|ref|NP_000240.1| MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|13905126|gb|AAH06850.1| MutL protein homolog 1... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... >gi|1079787|gb|AAA82079.1| DNA mismatch repair prot... MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... GenPept NCBI RefSeq Swiss-Prot PRF

25 Protein Sequences from Structures >gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ >gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ

26 Primary vs. Derivative Sequence DatabasesGenBank SequencingCenters GA ATT C C GA ATT C C AT GA ATT C C GA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters

27 RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins –reviewed –human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more Model transcripts and proteins Assembled Genomic Regions (contigs) –human genome –mouse genome –rat genome Chromosome records –Human genome –microbial –organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties] – chicken – honeybee – sea urchin

28 RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators

29 Selected RefSeq Accession Numbers mRNAs and Proteins NM_123456 Curated mRNA NP_123456 Curated Protein NR_123456 Curated non-coding RNA XM_123456 Predicted mRNA XP_123456 Predicted Protein XR_123456 Predicted non-coding RNA Gene Records NG_123456 Reference Genomic Sequence Chromosome NC_123455 Microbial replicons, organelle genomes, human chromosomes Assemblies NT_123456 Contig NW_123456 WGS Supercontig

30 GenBank to RefSeq

31 RefSeq : Genome Annotation Genomic DNA (NC, NT, NW) Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) Curated Protein (NP) Scanning.... = ? GenBank Sequences RefSeq

32 Mouse Assembly RefSeq Contig RefSeq Contig BAC WGS Other GenBank Other GenBank RefSeq Transcript RefSeq Transcript UniGene Transcript UniGene Transcript

33 Expressed Sequences UniGene GEO

34 A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?

35 EST hits: Human mRNA Albumin mRNA 5’ EST hits 3’ EST hits

36 UniGene Chordates Invertebrates Plants Fungi et al.

37 Xenopus laevis MLH1Cluster Uncharacterized ESTs

38 UniGene: Expressed Sequences

39 Expression Data

40 Other NCBI Databases Structure: imported structures (PDB) Cn3D viewer, NCBI curation CDD: conserved domain database Protein families (COGs and KOGs) Single domains (PFAM, SMART, CD) dbSNP: nucleotide polymorphism Gene: gene records Unifies LocusLink and Microbial Genomes

41 NCBI Structures and Domains

42 MM MMDB: Molecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: –Addition of explicit chemical graph information –Validation (secondary structure elements) –Inclusion of Taxonomy, Citation –Conversion to ASN.1 data description language Structure neighbors determined by Vector Alignment Search Tool (VAST)

43 Cn3D 4.1: Bacillus thuringiensis Toxin

44 VAST: Structure Neighbors Vector Alignment Search Tool For each protein chain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4 IL-4 & Leptin align the vectors

45 Protein Domains Structural Domain –Discrete independently folding unit of a protein Conserved Domain (sequence-based) –Protein region with recognizable position specific pattern of sequence conservation Sequence-based domains often roughly correspond to structural domains Domains often have distinct, identifiable functions

46 NCBI’s Conserved Domain Database PSI-BLAST –based score matrices Searchable with RPS-BLAST Sources –SMART –PFAM –COGs –NCBI curated domains structure informed alignments

47 Src Domains

48 Structure vs Conserved Domain SH2 SH3 TyrKC SH2 Conserved phosphotyrosine binding residues

49 NCBI Molecular Biology Resources Using Entrez

50 WWW Access Entrez & BLAST

51 Genomes Taxonomy Entrez: Database Integration PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure Word weight VAST BLAST Phylogeny Hard Link Neighbors Related Sequences Neighbors Related Sequences BLink Domains Neighbors Related Structures

52 Database Searching with Entrez uUsing limits and field restriction to find human MutL homolog uLinking and neighboring with MutL uMapping SNPs onto structure and the genome

53 Global NCBI (Entrez) Search Human hereditary nonpolyposis colon cancer

54 Global Entrez Search Results

55 Nucleotide Sequences Nucleotide database now three parts EST expressed sequence tags GSS genome survey sequences CoreNucleotide everything else Nucleotide database now three parts EST expressed sequence tags GSS genome survey sequences CoreNucleotide everything else

56 Advanced Search Options Tabs

57 More Precise Nucleotides Search nonpolyposis[All Fields] AND colon cancer[Title] AND human[Organism] AND biomol_mrna[Properties] AND srcdb_refseq[Properties] nonpolyposis[All Fields] AND colon cancer[Title] AND human[Organism] AND biomol_mrna[Properties] AND srcdb_refseq[Properties]

58 Useful Field Restrictions [Title]: Definition line in GenBank / GenPept format shown in Summary format glyceraldehyde 3 phosphate dehydrogenase[Title] [Organism]: NCBI’s taxonomy. Organizing system for molecular databases mouse[organism]; green plants[organism]; Streptomyces coelicolor[organism] [Properties]: molecule type, location, database source biomol_mrna[properties]; biomol_genomic[properties]; gene_in_mitochondrion[properties]; srcdb pdb[properties] [Filter]: subsets of data, Entrez links all[filter]; nucleotide mapview[filter]; nucleotide omim[filter]

59 Organism Field: NCBI’s Taxonomy

60 Useful Properties Field Terms Molecule type biomol_mrna biomol_genomic GenBank division gbdiv_est gbdiv_htg gbdiv_xxx Gene location gene_in_mitochondrion gene_in_chloroplast gene_in_genomic Source Database srcdb_refseq srcdb_pdb srcdb_swiss_prot

61 Human MutL RefSeq GenBank Records

62 NM_000249: Links

63 Literature Links OMIM

64 OMIM: Human Disease Genes Conserved Domain

65 Sequence Links Finding Homologs and Structures

66 Protein Link BLAST Link Conserved Domains

67 Related Proteins: Homologs and Redundancy Redundant Sequences Bacterial Homologs

68 BLink: BLAST Link top 200 only Redundant GIs

69 BLink: non-redundant relatives zebrafish homolog BLAST

70 Related Proteins: Structure Links

71 Structures

72 Short Cut: Related Structures

73 E. coli MutL Structure Cn3D viewer Conserved Domains 3D Domain Neighbors Structure Neighbors Pubchem compound

74 MLH1 Domain Structure: CDD ATPase Domain Mismatch Repair Domain

75 MLH1: ATPase Domain

76 Mapping Polymorphisms onto Structure

77 Entrez SNP "coding nonsynon"[Function Class]

78 GeneView: Variations Human MLH1 ATPase domain

79 Related Structures

80 Mapping Variation Onto Structure Conserved Asn Asn Ile Ile – Val

81 Genome Resources

82 Higher Genome Resources

83 Microbial Genomes

84 NM_000249: Genome Links

85 The Map Viewer Genome BLAST Previous Builds Available

86 Map Viewer: Human MLH1 Customizable NCBI Assembly EST Hits Gene Annotations Models Transcripts Download data and sequences

87 Maps and Options

88 Mapped Variations

89 Synteny: Mammalian Genomes

90 Homologene early globin gene A-chain gene B-chain gene frog A chick A mouse Amouse B chick B frog B paralogs orthologs gene duplication No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs No longer UniGene based Protein similarities first Guided by taxonomic tree Includes orthologs and paralogs

91 Homologene Cluster

92 Rice Homolog

93 The Gene Database Gene Centered Information Unifies LocusLink and microbial Genomes 2.4 million records for 3,822 taxa Human38,603Sea Urchin 30,603 Chimpanzee31,502Mosquito 13,763 Mouse60,746Fruit Fly 21,116 Rat38,117C. elegans 20,935 Dog20,154Fungi168,802 Cow23, 677Green Plants 76,847 Chicken18, 469Archea74,627 Zebrafish38, 594Bacteria1,361,390

94 Genes MLH1: One Stop Shopping

95 Genes MLH1: One Stop Shopping (cont.)

96 Genes: Display Options and Links


Download ppt "NCBI Molecular Biology Resources February 2007 Part 1."

Similar presentations


Ads by Google