Databases Vasileios Hatzivassiloglou University of Texas at Dallas
Databases Massive independently developed databases Sponsored by national institutes of biology/bioinformatics/health in the U.S., Europe, and Japan Allow for search Allow for entry of information by researchers, subject to curation Cross-linked
GenBank Developed and maintained by the U.S. National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH) Repository of gene information Provides DNA and literature search, comparison tools
GenBank statistics Dynamically updated A new version released as a flat file every two months Current version is (15 August 2008) 92.7 (76.1/61.1 one/two years ago) million sequences 95.0 (79.5/65.3 one/two years ago) billion base pairs
GenBank growth
Sample GenBank record LOCUS SCU bp DNA PLN 21-JUN-1999LOCUSSCU bpDNAPLN21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.DEFINITION ACCESSION U49845 VERSION U GI: KEYWORDS.ACCESSIONVERSIONGIKEYWORDS SOURCE Saccharomyces cerevisiae (baker's yeast)SOURCE ORGANISM Saccharomyces cerevisiae – Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.ORGANISM REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), (1994) MEDLINE PUBMED REFERENCEAUTHORSTITLEJOURNALMEDLINEPUBMED REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T.,... FEATURES (location, CDS, 5′ UTR, 3′ UTR, promoter, alternative splicing,...)FEATURES BASE COUNT 1510 a 1074 c 835 g 1609 tBASE COUNT ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa...
Genome Map Viewer Online at
SwissProt Maintained by the Swiss Institute for Bioinformatics Central repository of proteomic data Part of the Swiss ExPASy (Expert Proteomic Analysis System) Currently (Release 56.1, 2 Sept 2008) –397,500 (283,400 sequences one year ago) –143 million amino-acids (104 million one year ago)
SwissProt Growth
TrEMBL The non-curated counterpart to SwissProt Computer-annotated protein sequences awaiting curation Current release 39.2 (2 Sept 2008) 6.2 million proteins (4.7 million a year ago) 2 billion aminoacids (1.5 billion a year ago) SwissProt has entries for –only 6.5% of TrEMBL (6% a year ago)
SwissProt record Entry information –Entry name IL3_HUMANEntry name –Primary accession number P08700 Secondary accession numbers NonePrimary accession numberSecondary accession numbers –Entered in Swiss-Prot in Release 06, January 1988; Sequence was last modified in Release 12, October 1989 (revision 2); Annotations were last modified in Incremental Release, July 22, 2008 (revision 95)Entered in Swiss-Prot inSequence was last modified inAnnotations were last modified in Name and origin of the protein –Protein name Interleukin-3 [Precursor]Protein name –Synonyms IL-3; Multipotential colony-stimulating factor; Hematopoietic growth factor; P-cell stimulating factor; Mast-cell growth factor; MCGFSynonyms –Gene name Name: IL3Gene name –From Homo sapiens (Human) [TaxID: 9606] Taxonomy Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo. References [1] NUCLEOTIDE SEQUENCE. DOI= / (87)90254-X; PubMed= [NCBI, ExPASy, EBI, Israel, Japan] Dorssers L., Burger H., Bot F., Delwel R., Geurts van Kessel A.H.M., Loewenberg B., Wagemaker G.; "Characterization of a human multilineage-colony-stimulating factor cDNA clone identified by a conserved noncoding sequence in mouse interleukin-3."; Gene 55: (1987).FromHomo sapiens (Human)TaxID9606TaxonomyEukaryotaMetazoaChordataCraniata VertebrataEuteleostomiMammaliaEutheriaEuarchontogliresPrimatesCatarrhiniHominidaeHomo References / (87)90254-XNCBIExPASyEBIIsraelJapan Dorssers L.Burger H.Bot F.Delwel R.Geurts van Kessel A.H.M.Loewenberg B.Wagemaker G. Cross-references (links to databases for sequence, gene expression, 3D structure, interactions,...) Features (functional and structural components of the protein) Sequence Information (152 aminoacids)
Cross-References and Features Entry for Human Interleukin-3 – Feature viewer for the same protein – bin/ft_viewer.pl?P08700http:// bin/ft_viewer.pl?P08700
The Gene Ontology (GO) Hierarchical classification of genes, cross- linked across species Classification of related terminology Searchable via AmiGO
GO ontology fragment
Browsing with AmiGo AmiGo starting point for browsing and search bin/amigo/go.cgi?search_constraint=terms &action=replace_treehttp:// bin/amigo/go.cgi?search_constraint=terms &action=replace_tree
PubMed Interface to MEDLINE, NLM’s searchable index of publications in the biomedical field More than 15 million records since the 1950’s Can retrieve abstracts and citation details, but not full text Organized via the MeSH metathesaurus
MeSH Controlled vocabulary of 24,700 subject headings or descriptors Each linked to synonymous entry terms (151,000 of those) Articles in Medline are indexed using subject headings; each gets 1-2 major and about 10 other MeSH terms During search, entry terms are mapped to descriptors and related terms are added to the query (query expansion)
Medline record example Disambiguating proteins, genes, and RNA in text: a machine learning approach. Author(s): Hatzivassiloglou V; Duboué PA; Rzhetsky AHatzivassiloglou VDuboué PARzhetsky A Author's Address: Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027, USA. Source: Bioinformatics. [Bioinformatics] 2001; Vol. 17 Suppl 1, pp. S Bioinformatics. Publication Type: Evaluation Studies; Journal Article Language: English Journal Information: Country of Publication: England NLM ID: ISSN: Subsets: MEDLINE MeSH Terms: Artificial Intelligence* Genes* Proteins* RNA* Algorithms; Bayes Theorem; Comparative Study; Computational Biology; Data Collection; Natural Language Processing; Research Support, Non-U.S. Gov'tArtificial Intelligence*Genes*Proteins*RNA* AlgorithmsBayes TheoremComparative StudyComputational BiologyData CollectionNatural Language ProcessingResearch Support, Non-U.S. Gov't Abstract: We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%. CAS Registry Number:0 (Proteins) (RNA) Entry Date(s):Date Created: Date Completed: Latest Revision: Update Code: PMID: Database: MEDLINE