Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

Slides:



Advertisements
Similar presentations
Consistent and standardized common model to support large-scale vocabulary use and adoption Robust, scalable, and common API to reduce variation in clinical.
Advertisements

GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Introduction to PubMed® (pubmed.gov)
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Mining Brain-Related Transcription Factor- Disease Relationships for Novel Linkages Unified Medical Language System (UMLS) Specialised Databases Atlas.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Knowledge Management for Disease Coding (KMDC): Background & Introduction Timothy Hays, Ph.D. Project Manager, Knowledge Management for Disease Coding.
The Role of the UMLS in Vocabulary Control CENDI Conference “Controlled Vocabulary and the Internet” Stuart J. Nelson, MD.
Who am I Gianluca Correndo PhD student (end of PhD) Work in the group of medical informatics (Paolo Terenziani) PhD thesis on contextualization techniques.
Evidence-Based Information Retrieval in Bioinformatics
Archives and Information Retrieval
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
The Role of Standard Terminologies in Facilitating Integration James J. Cimino, M.D. Departments of Biomedical Informatics and Medicine Columbia University.
COG and GO tutorial.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Literature Mining Tools for Analysis of Genomic Data Ramin Homayouni, Ph.D. Associate Professor of Biology Director of Bioinformatics UTHSC BINF April.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
A Study of Cystic Fibrosis Using Web-Based Tools Anuradha Datta Murphy Graduate Student, Dept. of Molecular and Integrative Physiology, University of Illinois.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2007 National Library of Medicine National Institutes of Health U.S. Dept. of Health.
1 Betsy L. Humphreys, MLS Betsy L. Humphreys, MLS National Library of Medicine National Library of Medicine National Institutes of Health National Institutes.
Combining Numerical and Semantic Analysis for Biological Data Daniel R. Masys, M.D. Professor and Chair Department of Biomedical Informatics Professor.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2005 May 16 & 17, 2005 Rachel Kleinsorge.
Bioinformatics and medicine: Are we meeting the challenge?
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
Doug Brutlag 2011 Bibliographic Search Doug Brutlag Professor Emeritus of Biochemistry.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
8 October 2009Microbial Research Commons1 Toward a biomedical research commons: A view from NLM-NIH Jerry Sheehan Assistant Director for Policy Development.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Asp/IEETA Health-Grid Workshop Brussels 20 th September 2002 A. Sousa Pereira Univ. Aveiro - IEETA.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Overview of Bioinformatics 1 Module Denis Manley..
To Boldly GO… Amelia Ireland GO Curator EBI, Hinxton, UK.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Japan Consortium for Glycobiology and Glycotechnology DataBase 日本糖鎖科学統合データベース GDGDB - Glyco-Disease Genes Database The complexity of glycan metabolic pathways.
1 A text-mining analysis of the human phenome Marc A van Driel 1, Jorn Bruggeman 2, Gert Vriend 1, Han G Brunner *,3 and Jack AM Leunissen 2 European Journal.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
1 An Introduction to Ontology for Scientists Barry Smith University at Buffalo
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Joined up ontologies: incorporating the Gene Ontology into the UMLS.
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Copyright OpenHelix. No use or reproduction without express written consent1.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
The UMLS and the Semantic Web
Databases, Ontologies and Text mining Session Introduction Part 2
Saccharomyces Genome Database (SGD)
Department of Genetics • Stanford University School of Medicine
Major Databases/Portals
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri

2 Research Collaborators Olivier Bodenreider, M.D., Ph.D. Alexa T. McCray, Ph.D. Allen C. Browne

3 Research Goals Investigating methods of connecting the disease and genomic information. Overall goals are to: – Overcome difficulties traversing multiple information resources – Examine coverage of Unified Medical Language System ® (UMLS ® ), Gene Ontology TM (GO), LocusLink-OMIM – Develop methods to use ontologies more effectively – Present data in understandable manner

4 Background – UMLS NLM developed, maintains Purpose: facilitate retrieval & integration of information from multiple biomedical sources Interrelates 60 biomedical terminologies – MeSH, SNOMED, Read Codes, ICD, etc – No vocabulary focused on molecular biology 1.5 million English terms; 800,000 concepts; 134 Semantic Types; 54 Semantic Relationships

5 Background – Gene Ontology GO Consortium developed, maintains Purpose: – promoting cross-species methodologies for functional comparisions – Allows annotation of molecular information on genes, gene products – “an essential start to creating a shared language of biology” ** Focused on – molecular function (5626 terms) – biological processes (4677 terms) – cellular components (1077 terms) Two semantic relations (is-a and part-of) **Genome Research 2001; 11:

6 Background - LocusLink Curated, gene-centered resource of National Center for Biotechnology Information (NLM) Gene names, gene product names, gene product functions, and reference sequences (DNA, RNA, protein) Associates phenotype (diseases) to the genotype via Online Mendelian Inheritance in Man (OMIM) Online links to major bioinformatics knowledge bases and the literature

7 Specific Questions This study looked at coverage in UMLS of genes associated with human diseases diseases associated with the genes 3. 11,380 Gene Ontology terms 4. 38,832 genes/gene products in GO database (141,071 names) 5. Associations of genes and their functions in UMLS 6. Representation of gene function in GO compared to the UMLS

8 Methods LocusLink query: – human genes whose sequence is known and associated with disease (1244 loci) LocusLink data: – Genes/gene products (official names, synonyms, symbols) – Phenotypes (diseases) (1702 diseases) GO data: – all concepts (ontology terms), excluding obsolete terms (11,380 terms) – Gene products from all species (134,646 unique names, 38,832 genes)

9 Methods LocusLink and GO terms mapped to UMLS concepts – normalization used – mappings constrained by semantic type LocusLink loci studied for relationships in UMLS – Gene/GP – phenotype – Gene/GP – molecular function – Gene/GP – biological process – Gene/GP – cellular component For specific genes compared annotations in GO to representation in UMLS

10 Results - 1 For 1244 genes from LocusLink – 18% found in the UMLS Official gene name20%244/1244 Official gene symbol16%200/1244 Alias symbol15%394/2669 Gene product18%266/1460 Preferred product18%266/1460 Alias protein24%339/1425

11 Results - 2 For 1702 phenotypes (diseases) corresponding to 1244 genes – 34% found in the UMLS (575/1244) Most frequent single gene diseases covered – Huntington Disease – Cystic Fibrosis – Marfan Syndrome – Phenylketonuria – Achondroplasia

12 Results - 3 GO terms found in MeSH2764 terms GO terms found in SNOMED1366 terms GO terms found overall: 27% 3062/11,380 Molecular function44%2435/5626 Biological process 5%256/4677 Cellular component35%370/1077

13 Results - 4 For 134,646 unique gene names in GO database Full name11%4392/38,832 Symbol2%1167/60,381 Synonym6%1964/35,433

14 Results - 5 LocusLink – UMLS Relationship Categories found overall:72% Genes & gene products Phenotype64%754/1182 M. Function85%1192/1409 B. Process61%762/1240 C. component76%841/1107

15 Results - 5 Type of Relationship Associative 613 Co-occurrence3353 Hierarchical1168 G/GP and AssocCo-ocHier Phenotype M. Function B. Process C. Component

16 Results - 6 Representation of gene function in GO compared to the UMLS

17 Neurofibromin 2 – merlin in GO

18

19

20 Discussion

21 Best & Worst Mappings Best mapping categories Molecular function (GO)44% Cellular component (GO)35% Phenotype (LL)34% Worst mapping categories Gene synonym (GO) 6% Biological process (GO) 5% Gene symbol(GO) 2%

22 Only 34% of diseases? In OMIM-LL, diseases are subdivided by genetic causes but not in UMLS E.g. Limb Girdle Muscular Dystrophy LGMD is represented in UMLS A SNOMED term in MeSH it is an entry term for muscular dystrophies MeSH notes for MD: A general term for a group of inherited disorders which are characterized by progressive degeneration of skeletal muscles (ed, 2000)

23 Limb Girdle Muscular Dystrophy – genetic types LGMD typeGene NameLGMD typeGene Name 1AMyotilin2CSarcoglycan-gamma 1BLamin A/C2DSarcoglycan-alpha 1CCaveolin-32ESarcoglycan-beta 1DUnknown2FSarcoglycan-delta 2ACalpain-32GTelethonin 2BDysferlin2HTRIM32 2IFukutin-related protein

24 Only 5% of Biological Processes? Only 256 of the biological processes mapped to terms in UMLS. In GO, processes are elaborated & organism specific Example: UMLS - Mitotic spindle GO – Mitotic spindle assembly – Mitotic spindle assembly (sensu Saccharomyces) – Mitotic spindle assembly (sensu Fungi) – Mitotic spindle checkpoint – Mitotic spindle elongation – Mitotic spindle orientation – Mitotic spindle positioning – Mitotic spindle positioning and orientation

25 Why so few gene names and synonyms mapped? Official gene names have metadata and comments. – dystrophin (muscular dystrophy, Duchenne and Becker types), includes DXS143, DXS164, DXS206, DXS230, DXS239, DXS 268, DXS269, DXS270 DXS272 No single source has all names and synonyms GO synonym field contains IPI number for well known genes, does not match UMLS (useful cross reference but not a synonym) Symbols are short acronyms and match poorly

26 Summary 1 UMLS needs improvement in molecular biology domain but has considerable content: – 27% of GO concepts map – 34% of single gene diseases – Existing UMLS terms come primarily from MeSH and SNOMED Overall, positive mapping for 13,000 terms

27 Summary continued If the terms are in UMLS, it is possible to find a relationship between genes and phenotypes and gene function much of the time. UMLS does better with the human genes (20%+) than with genes from all organisms (11%) UMLS and GO representations complement each other.