Databases Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
PubMed/How to Search, Display, Download & (module 4.1)
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
PubMed.
Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to… Edward Marcotte/Univ. of Texas/BCH391L/Spring.
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Introduction to PubMed® (pubmed.gov)
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Searching Pubmed Database استخدام قاعدة المعلومات Pubmed د. سيناء عبد المحسن العقيل قسم الصيدلة الإكلينيكية برنامج مهارات البحث العلمي.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
NATIONAL LIBRARY OF MEDICINE The PubMed ID and Entrez, PubMed and PubMed Central Edwin Sequeira National Center for Biotechnology Information June 21,
Sequence Databases April 28, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Toward Making Online Biological Data Machine Understandable Cui Tao.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.
Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
PubMed/How to Search, Display, Download & (module 4.1)
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI Literature Databases: PubMed
EB3233 Bioinformatics Introduction to Bioinformatics.
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
1 An Introduction to Ontology for Scientists Barry Smith University at Buffalo
ISI Web of Knowledge update: October What’s New? Conference Proceedings Citation Indexes now in Web of Science –Two editions – Science and Social.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
The National Library of Medicine and its databases Lívia Vasas, PhD
PubMed Basics Barbara A. Wood, MLIS Calder Library University of Miami Miller School of Medicine.
MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
The National Library of Medicine and its databases a PhD Lívia Vasas February.
Introduction to Genes and Genomes with Ensembl
The National Library of Medicine and its databases
Protein databases Henrik Nielsen
Archives and Information Retrieval
생물정보학 Bioinformatics.
Livia Vasas PhD Budapest, September 2011.
Lívia Vasas, PhD 2018 The National Library of Medicine and its databases Mozilla Firefox/Google Chrome Lívia Vasas, PhD.
The National Library of Medicine and its databases
Lívia Vasas, PhD 2018 The National Library of Medicine and its databases Mozilla Firefox/Google Chrome Lívia Vasas, PhD.
Genomes and Their Evolution
Introduction to Bioinformatics
Source Page Understanding for Heterogeneous Molecular Biological Data
Lívia Vasas, PhD 2018 The Nation Library of Medicine and its databases Mozilla Firefox or Google Chrome Lívia Vasas, PhD.
Lesson 3 Bioinformatics Laboratory
PubMed.
The National Library of Medicine and its databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Databases Vasileios Hatzivassiloglou University of Texas at Dallas

Databases Massive independently developed databases Sponsored by national institutes of biology/bioinformatics/health in the U.S., Europe, and Japan Allow for search Allow for entry of information by researchers, subject to curation Cross-linked

GenBank Developed and maintained by the U.S. National Center for Biotechnology Information (NCBI), part of the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH) Repository of gene information Provides DNA and literature search, comparison tools

GenBank statistics Dynamically updated A new version released as a flat file every two months Current version is (15 August 2008) 92.7 (76.1/61.1 one/two years ago) million sequences 95.0 (79.5/65.3 one/two years ago) billion base pairs

GenBank growth

Sample GenBank record LOCUS SCU bp DNA PLN 21-JUN-1999LOCUSSCU bpDNAPLN21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.DEFINITION ACCESSION U49845 VERSION U GI: KEYWORDS.ACCESSIONVERSIONGIKEYWORDS SOURCE Saccharomyces cerevisiae (baker's yeast)SOURCE ORGANISM Saccharomyces cerevisiae – Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.ORGANISM REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), (1994) MEDLINE PUBMED REFERENCEAUTHORSTITLEJOURNALMEDLINEPUBMED REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T.,... FEATURES (location, CDS, 5′ UTR, 3′ UTR, promoter, alternative splicing,...)FEATURES BASE COUNT 1510 a 1074 c 835 g 1609 tBASE COUNT ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa...

Genome Map Viewer Online at

SwissProt Maintained by the Swiss Institute for Bioinformatics Central repository of proteomic data Part of the Swiss ExPASy (Expert Proteomic Analysis System) Currently (Release 56.1, 2 Sept 2008) –397,500 (283,400 sequences one year ago) –143 million amino-acids (104 million one year ago)

SwissProt Growth

TrEMBL The non-curated counterpart to SwissProt Computer-annotated protein sequences awaiting curation Current release 39.2 (2 Sept 2008) 6.2 million proteins (4.7 million a year ago) 2 billion aminoacids (1.5 billion a year ago) SwissProt has entries for –only 6.5% of TrEMBL (6% a year ago)

SwissProt record Entry information –Entry name IL3_HUMANEntry name –Primary accession number P08700 Secondary accession numbers NonePrimary accession numberSecondary accession numbers –Entered in Swiss-Prot in Release 06, January 1988; Sequence was last modified in Release 12, October 1989 (revision 2); Annotations were last modified in Incremental Release, July 22, 2008 (revision 95)Entered in Swiss-Prot inSequence was last modified inAnnotations were last modified in Name and origin of the protein –Protein name Interleukin-3 [Precursor]Protein name –Synonyms IL-3; Multipotential colony-stimulating factor; Hematopoietic growth factor; P-cell stimulating factor; Mast-cell growth factor; MCGFSynonyms –Gene name Name: IL3Gene name –From Homo sapiens (Human) [TaxID: 9606] Taxonomy Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo. References [1] NUCLEOTIDE SEQUENCE. DOI= / (87)90254-X; PubMed= [NCBI, ExPASy, EBI, Israel, Japan] Dorssers L., Burger H., Bot F., Delwel R., Geurts van Kessel A.H.M., Loewenberg B., Wagemaker G.; "Characterization of a human multilineage-colony-stimulating factor cDNA clone identified by a conserved noncoding sequence in mouse interleukin-3."; Gene 55: (1987).FromHomo sapiens (Human)TaxID9606TaxonomyEukaryotaMetazoaChordataCraniata VertebrataEuteleostomiMammaliaEutheriaEuarchontogliresPrimatesCatarrhiniHominidaeHomo References / (87)90254-XNCBIExPASyEBIIsraelJapan Dorssers L.Burger H.Bot F.Delwel R.Geurts van Kessel A.H.M.Loewenberg B.Wagemaker G. Cross-references (links to databases for sequence, gene expression, 3D structure, interactions,...) Features (functional and structural components of the protein) Sequence Information (152 aminoacids)

Cross-References and Features Entry for Human Interleukin-3 – Feature viewer for the same protein – bin/ft_viewer.pl?P08700http:// bin/ft_viewer.pl?P08700

The Gene Ontology (GO) Hierarchical classification of genes, cross- linked across species Classification of related terminology Searchable via AmiGO

GO ontology fragment

Browsing with AmiGo AmiGo starting point for browsing and search bin/amigo/go.cgi?search_constraint=terms &action=replace_treehttp:// bin/amigo/go.cgi?search_constraint=terms &action=replace_tree

PubMed Interface to MEDLINE, NLM’s searchable index of publications in the biomedical field More than 15 million records since the 1950’s Can retrieve abstracts and citation details, but not full text Organized via the MeSH metathesaurus

MeSH Controlled vocabulary of 24,700 subject headings or descriptors Each linked to synonymous entry terms (151,000 of those) Articles in Medline are indexed using subject headings; each gets 1-2 major and about 10 other MeSH terms During search, entry terms are mapped to descriptors and related terms are added to the query (query expansion)

Medline record example Disambiguating proteins, genes, and RNA in text: a machine learning approach. Author(s): Hatzivassiloglou V; Duboué PA; Rzhetsky AHatzivassiloglou VDuboué PARzhetsky A Author's Address: Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027, USA. Source: Bioinformatics. [Bioinformatics] 2001; Vol. 17 Suppl 1, pp. S Bioinformatics. Publication Type: Evaluation Studies; Journal Article Language: English Journal Information: Country of Publication: England NLM ID: ISSN: Subsets: MEDLINE MeSH Terms: Artificial Intelligence* Genes* Proteins* RNA* Algorithms; Bayes Theorem; Comparative Study; Computational Biology; Data Collection; Natural Language Processing; Research Support, Non-U.S. Gov'tArtificial Intelligence*Genes*Proteins*RNA* AlgorithmsBayes TheoremComparative StudyComputational BiologyData CollectionNatural Language ProcessingResearch Support, Non-U.S. Gov't Abstract: We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%. CAS Registry Number:0 (Proteins) (RNA) Entry Date(s):Date Created: Date Completed: Latest Revision: Update Code: PMID: Database: MEDLINE