NCBI Molecular Biology Resources

NCBI Molecular Biology Resources

The International Sequence Database Collaboration
NIH SRS getentry Entrez Submissions Updates Sequin BankIt WebIn SAKURA NCBI GenBank EMBL DDBJ EBI CIB NIG EMBL

Primary vs. Derivative Databases
C GA ATT GA C GA C ATT GA UniGene C Algorithms TATAGCCG Sequencing Centers ACGTGC ATTGACTA ACGTGC CGTGA TTGACA UniSTS EST GenBank Updated continually by NCBI STS Updated ONLY by submitters RefSeq: Annotation Pipeline GSS HTG INV VRT PHG VRL PRI ROD PLN MAM BCT ACGTGC RefSeq: LocusLink and Genomes Pipelines Curators TATAGCCG AGCTCCGATA CCGATGACAA Labs

Types of Databases Primary Databases Derivative Databases
Original submissions by experimentalists Remember biologyʼs Central Dogma: DNA → RNA → protein. Primary refers to one dimensional ʻsymbolʼ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or polynucleotide. Content controlled by the submitter Examples: GenBank, SNP, GEO, PubChem Substance Derivative Databases Built from primary data Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, Protein, Structure, Conserved Domain, PubChem Compound Primary databases serve as a repository of experimentalist sequences (GenBank). Derivative databases are sources of edited/curated sequences (RefSeq…reference sequences, UniGene...genes compared to genetic loci on genomes)

What is Entrez? Entrez Global Query is an integrated search and retrieval system for databases of National Center for Biotechnology Information (NCBI). It provides access to all NCBI databases simultaneously with a single query string and user interface. Support boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database. A text search / retrieval engine NOT A DATABASE. A tool for finding biologically linked data. A virtual workspace for manipulating large datasets.

Entrez Databases Each record is assigned a UID
unique integer identifier for internal tracking GI number for Nucleotide Each record is given a Document Summary a summary of the record’s content (DocSum) Each record is assigned links to biologically related UIDs Each record is indexed by data fields [author], [title], [organism], and many others

The Entrez System Entrez Journals UniGene Books SNP PubMed UniSTS
Central UniSTS Nucleotide PopSet Protein Entrez ProbeSet DNA/RNA overview Genome Structure Taxonomy CDD 3D Domains OMIM

The Entrez System: Text Searches

Entrez Taxonomy The backbone of NCBI [organism]

An Entrez Database - Nucleotide
GenBank: Primary Data (97.9%) original submissions by experimentalists submitters retain editorial control of records archival in nature RefSeq: Derivative Data (2.1%) curated by NCBI staff NCBI retains editorial control of records record content is updated continually

What is GenBank? NCBI’s Primary Sequence Database
Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number GenBank Data Direct submissions (traditional records ) Batch submissions (EST, GSS, STS) ftp accounts (genome data) Three collaborating databases GenBank DNA Database of Japan (DDBJ) European Molecular Biology Laboratory (EMBL) Database

A Traditional GenBank Record
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY GI: KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // Header The Flatfile Format Feature Table Sequence

An Example Record – M17755 Field Indexed Terms
Indexing for Nucleotide UID Field Indexed Terms [primary accession] M17755 [title] Homo sapiens thyroid peroxidase (TPO) mRNA… [organism] Homo sapiens [sequence length] 3060 [modification date] 1999/04/26 [properties] biomol mrna gbdiv pri srcdb genbank

M17755: Feature Table TPO [gene name] CDS position in bp thyroiditis
[text word] thyroid peroxidase [protein name] protein accession

Sequence: 99.99% Accurate The sequence itself is not indexed…
Use BLAST for that!

RefSeq: NCBI’s Derivative Sequence Database
RefSeq database is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein. Non-redundant nucleotide and protein sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Updated to reflect current sequence data and biology Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Similar to a review article, a RefSeq integrates information across multiple sources at a given time hence provides a foundation for uniting sequence data with genetic and functional information. They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records.

The common Refseq accession prefix
Molecular type NC_ Complete genomic molecule (chromosome; microbial or organelle genome) NT_ Genomic contig NM_ Curated mRNA XM_ mRNA (Computed) NP_ Curated Protein XP_ Protein (Computed) NR_ Curated RNA XR_ RNA(Computed)

Entrez Gene and RefSeq Gene GenBank RefSeq Nucleotide Entrez Gene is the central depository for information about a gene available at NCBI, and often provides links to sites beyond NCBI Entrez Gene includes records for organisms that have NCBI Reference Sequences (RefSeqs) Entrez Gene records contain RefSeq mRNAs, proteins, and genomic DNA (if known) for a gene locus, plus links to other Entrez databases

Entrez Gene: RefSeq Annotations

Beyond RefSeq If your organism does not have RefSeqs…
UniGene : gene-based clusters of cDNAs and ESTs WGS sequences in Entrez Nucleotide (wgs[prop]) Trace Archive

What is UniGene? A gene-oriented view of sequence entries
MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery and selection of mapping reagents Clusters of ESTs based on automatic similarity. Each cluster represents a gene.

Organisms in UniGene Top Ten 1. Human 2. Rice 3. Mouse 4. Cow 5. Wheat
6. Zebrafish 7. Pig 8. Chicken 9. Frog (X. laevis) 10. Frog (X. tropicalis)

Finding UniGene Clusters
by link by Entrez search

UniGene Cluster for TPO

Entrez Protein GenPept (DDBJ, EMBL, GenBank) 4,444,405
RefSeq ,753,167 PIR ,395 Swiss Prot ,005 PDB ,621 PRF ,079 Third Party Annotation ,219 Total ,693,891 The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB.

Protein Sources and Links
PIR no mRNA! RefSeq  NM_000537 SWISS-PROT no mRNA! GenPept  M17755

PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published world-wide. It has 12 million records dating back to 1966. In order to impose uniformity and consistency to the indexing of biomedical literature MeSH vocabulary is used for indexing journal articles for MEDLINE. MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM.

Taxonomy Browser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms

Structure site includes…
Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST)

OMIM is… Online Mendelian Inheritance in Man
catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU

General Protein Databases
SWISS-PROT Manually curated high-quality annotations, less data GenPept/TREMBL Translated coding sequences from GenBank/EMBL Few annotations, more up to date PIR Phylogenetic-based annotations All 3 now combining efforts to form UniProt ( Unlike the main nucleotide sequence databases which all contain the same sequence data, the main public protein sequence databases all have their different “personality”. SWISS-PROT, which comes, not so surprisingly, from Switzerland, is a manually curated database. As such it has very high quality, consistent annotations, which make it very suitable to keyword searching. However, since the validation and annotation processes take time, SWISS-PROT does not contain as much data as other protein databases. SWISS-PROT has recently become private and as such is available freely only to academics. GenPept and TrEMBL are databases generated automatically by translating coding sequences (CDS features) from Genbank and EMBL respectively. TrEMBL sequences eventually get annotated and transferred into SWISS-PROT. The quality of annotations in these databases is not as high as that of SWISS-PROT but these databases are a lot more up to date than SWISS-PROT PIR is also annotated but its annotations are different from those of SWISS-PROT. The format is often less convenient for text searching but entries contain information about the record’s superfamily which is often hard to find in other databases.

Swissprot format

Non-redundant Databases
Sequence data only: cannot be browsed, can only be searched using a sequence Combine sequences from more than one database Examples: NR Nucleic (genbank+EMBL+DDBJ+PDB DNA) NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein) The existence of several sequence databases which may not necessarily contain the same sequence information can present a problem when attempting to search as many sequences as possible: either one searches every database and gets mostly the same entries repeated across the many databases (not to mention the extra work of repeating the same search with different databases) or one searches only a subset of available databases and risks missing out on some valuable entry. This concern is now addressed by creating non-redundant databases which combine several databases together and remove the duplicate entries. So for example, the non-redundant nucleotide database maintained at NCBI contains Genbank plus the sequences in EMBL which are not in Genbank plus the sequences in DDBJ which are not Genbank or EMBL plus the sequences in PDB Nucleic which are not in any of the other 3 databases. Non-redundant databases built from a collection of databases usually contain only sequence data since the annotation format is not consistent across databases. Note also that in the process of removing duplicate sequences, some very similar but not perfectly identical sequences (including variants, mutations etc) can be removed from the non-redundant database.

Protein domain databases
Pfam ( Collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families SMART (a Simple Modular Architecture Research Tool) Identification and annotation of genetically mobile domains and the analysis of domain architectures ( CDD ( Combines SMART and Pfam databases Easier and quicker search

Sequence Motif Databases
Scan Prosite ( and PRINTS ( Store conserved motifs occurring in nucleic acid or protein sequences Motifs can be stored as consensus sequences, alignments, or using statistical representations such as residue frequency tables A sequence motif is a pattern of conservation in a group of sequences which reflects a common function or structure in these sequences. Motifs can be represented in several ways in a computer, and the type of analysis that can be performed with the motif depends on the representation. For example, the most common representation for a motif is a consensus sequence which shows which residues are allowed or not allowed at each position of the sequence. This type of representation is very common in the literature but has a number of drawbacks in that it provides only a qualitative description of the motif. As a result consensus sequences are either too specific and do not represent all the motifs they are meant to be represent, or they are not specific enough and match false positives. Other representations of motifs include multiple sequence alignments of the sequences, and numerical representations such as profiles and weight matrices which assign a “weight” to each possible residue in the motif. Hidden Markov models (HMMs) are statistical representations of motifs where each possible residue at each position is assigned a probability which can be dependent not only on the position but also on its neighborhood.

NCBI Molecular Biology Resources

Similar presentations

Presentation on theme: "NCBI Molecular Biology Resources"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NCBI Molecular Biology Resources

Similar presentations

Presentation on theme: "NCBI Molecular Biology Resources"— Presentation transcript:

Similar presentations

About project

Feedback