Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center.

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Semantic indexing in PubMed CERN Workshop on Innovations in Scholarly Communication (OAI8) CERN Workshop on Innovations in Scholarly Communication (OAI8)
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Natural Language Processing Group HUGs Geneva Start.
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.
Summary Issues and Suggestions Workshop on The Future of the UMLS Semantic Network NLM, April 8, 2005 Olivier Bodenreider Lister Hill National Center for.
Who am I Gianluca Correndo PhD student (end of PhD) Work in the group of medical informatics (Paolo Terenziani) PhD thesis on contextualization techniques.
Machine Learning and the Semantic Web
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Ontology Notes are from:
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
FMA: a domain reference ontology Comments on Cornelius Rosse’s talk Anita Burgun WG6 meeting, Rome 29 Apr- 2 May 2005.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
UML CASE Tool. ABSTRACT Domain analysis enables identifying families of applications and capturing their terminology in order to assist and guide system.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Enhancing Ontologies Through Annotations Training Course Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Enriching Bio-ontologies with Non-hierarchical Relations Workshop on bio-ontologies October 28, 2005 Olivier Bodenreider Lister Hill National Center for.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2007 National Library of Medicine National Institutes of Health U.S. Dept. of Health.
Automated Classification of Medical Questions Using Semantic Parsing Techniques Paul E. Pancoast, MD Arthur B. Smith, MS Chi-Ren Shyu, PhD University of.
Citation Biomedical Informatics Data ➜ Information ➜ Knowledge BMI Biomedical Named Entity Recognition Ramakanth Kavuluru NLP Seminar – 8/21/2012.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Logics for Data and Knowledge Representation Semantic Matching.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2005 May 16 & 17, 2005 Rachel Kleinsorge.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
1 Enriching and Designing Metaschemas for the UMLS Semantic Network Department of Computer Science New Jersey Institute of Technology Yehoshua Perl James.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Flexible Text Mining using Interactive Information Extraction David Milward
Lexical Tools Briefing The Lexical Systems Group NLMNLM. LHNCBC. CGSBLHNCBCCGSB June, 2006.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Consistency between Metathesaurus and Semantic Network Workshop on The Future of the UMLS Semantic Network NLM, April 8, 2005 Olivier Bodenreider Lister.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Enabling complex queries to drug information sources through functional composition Olivier Bodenreider Lister Hill National Center for Biomedical Communications.
Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Semantic Relations for Interpreting DNA Microarray Data and for Novel Hypotheses Generation Dimitar Hristovski, 1 PhD, Andrej Kastrin, 2 Borut Peterlin,
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Oncologic Pathology in Biomedical Terminologies Challenges for Data Integration Olivier Bodenreider National Library of Medicine Bethesda, Maryland -
Of 24 lecture 11: ontology – mediation, merging & aligning.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Oncology in SNOMED CT NCI Workshop The Role of Ontology in Big Cancer Data Session 3: Cancer big data and the Ontology of Disease Bethesda, Maryland May.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
The UMLS and the Semantic Web
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Presentation transcript:

Lexical and Statistical Approaches to Acquiring Ontological Relations Formal Methods for Casual Ontology? Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Ontology and Biomedical Informatics Rome, Italy – May 1, 2005

2 Introduction u Biomedical ontologies l Precisely defined (e.g., formal ontology) l Limited size l Built manually u Large amounts of knowledge l Not represented explicitly by symbolic relations l But expressed implicitly n By lexico-syntactic relations (i.e., embedded in terms) n By statistical relations (e.g., co-occurrence) l Can be extracted automatically

3 Formal vs. casual ontology u Formal ontology l Provides a framework for building sound ontologies l Too labor-intensive for building large ontologies u Casual ontology l Usually unsuitable for reasoning l Tools for automatic acquisition available

4 General framework u Ontology learning l l [Maedche & Staab, Velardi] l l ECAI, IJCAI u Term variation[Jacquemin] u Terminology / KnowledgeTKE, TIA u Knowledge acquisition/captureK-CAP u Information extraction

5 Sources of knowledge for casual ontology u Long tradition of terminology building l Over 100 terminologies available in electronic format u Large corpora available (e.g., MEDLINE) l Entity recognition tools available n E.g., MetaMap (UMLS-based) n Several for gene/protein names l Information extraction methods u Large annotation databases available l MEDLINE citations indexed with MeSH l Model organism databases annotated with GO

6 Formal methods for casual ontology u Lexico-syntactic methods l Lexico-syntactic patterns l Nominal modification l Prepositional phrases l Reified relations l Semantic interpretation u Statistical methods l Clustering l Statistical analysis of co-occurrence data l Association rule mining

Lexico-syntactic methods

8 Synonymy u Source: terminology u Lexical similarity l Lexical variant generation program (UMLS) l norm u Limitations l Clinical synonymy vs. Synonymy l Molecular biology [McCray & al., SCAMC, 1994]

9 Normalization Hodgkin’s diseases, NOS Hodgkin diseases, NOS Remove genitive Hodgkin diseases, Remove stop words hodgkin diseases, Lowercase hodgkin diseases Strip punctuation hodgkin disease UninflectSort words disease hodgkin

10 Normalization Example Hodgkin Disease HODGKINS DISEASE Hodgkin's Disease Disease, Hodgkin's Hodgkin's, disease HODGKIN'S DISEASE Hodgkin's disease Hodgkins Disease Hodgkin's disease NOS Hodgkin's disease, NOS Disease, Hodgkins Diseases, Hodgkins Hodgkins Diseases Hodgkins disease hodgkin's disease Disease, Hodgkin normalize disease hodgkin

11 Taxonomic relations Lexico-syntactic patterns u Source: text corpus u Example of patterns l Lamivudin is a nucleoside analogue with potent antiviral properties. l The treatment of schizophrenia with old typical antipsychotic drugs such as haloperidol can be problematic. [Hearst, COLING, 1992] [Fiszman & al., AMIA, 2003]

12 Taxonomic relations Nominal modification u Source: text corpus / terminology u Example of modifiers l Adjective n Tuberculous Addison’s disease n Acute hepatitis l Noun (noun-noun compounds) n Prostate cancer n Carbon monoxide poisoning [Jacquemin, ACL, 1999] [Bodenreider & al., TIA, 2001] Terminology: constrained environment (increased reliability)

13 Reified relations u Source: terminology u Example: reification of part of u Augmented relations from reified part-of relations l Reified: l Reified: l Augmented: l Augmented: [Zhang & al., ISWC/Sem. Int., 2003]

14 Prepositional attachment u Source: text corpus / terminology u Example: of l Lobe of lung  part of Lung l Bone of femur  part of Femur u Restrictions l Validity of preposition-to-relation correspondence may be limited to a subdomain (e.g., anatomy) l Not applicable to complex terms n Groove for arch of aorta  NOT part of Aorta [Zhang & al., ISWC/Sem. Int., 2003]

15 Semantic interpretation u Source: text corpus / terminology u Correspondence between l Linguistic phenomena l Semantic relations u Semantic constraints provided by ontologies [Navigli & al., TKE, 2002] [Romacker, AIME, 2001] [Rindflesch & al., JBI, 2003]

16 Semantic interpretation Hemofiltration in digoxin overdose Disease or Syndrome Therapeutic or Preventive Procedure Digoxin overdoseHemofiltration Mapping to UMLS Syntactic analysis indigoxin overdose noun phrase Hemofiltration preposition treats Semantic rules Semantic network relationships AntibiotictreatsDisease or Syndrome Medical DevicetreatsInjury or Poisoning Pharmacologic SubstancetreatsCongenital Abnormality Ther. or Prev. ProceduretreatsDisease or Syndrome […] Select matching rule treats Semantic interpretation

17 Compositional features of terms u Lexical items u Terms within a vocabulary l Clinical vocabularies l Gene Ontology u Terms across vocabularies l SNOMED / LOINC l GO / ChEBI u Lexicon / Terms l Semantic lexicon [Baud & al., AMIA, 1998] [Ogren & al., PSB, 2004] [Mungall, CFG, 2004] [Johnson, JAMIA, 1999] [Verspoor, CFG, 2005] [Burgun, SMBM, 2005] [Dolin, JAMIA, 1998] [McDonald & al., AMIA, 1999]

Statistical methods

19 Taxonomic relations Clustering u Source: text corpus u Principle: similarity between words reflected in their contexts l Co-occurring words (+ frequencies) l Hierarchical clustering algorithms n Similarity measure (cosine, Kullback Leibler) u Can be refined using classification techniques (e.g., k nearest neighbors) [Faure & al., LREC, 1998] [Maedche & al., HoO, 2004]

20 Associative relations u Source: text corpus / annotation databases u Principle: dependence relations l Associations between terms u Several methods l Vector space model l Co-occuring terms l Association rule mining u Limitations: no semantics [Bodenreider & al., PSB, 2005]

21 Similarity in the vector space model Annotation database GO terms Genes g1g1 g2g2 …gngn t1t1 t2t2 … tntn GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn 

22 Similarity in the vector space model GO terms t1t1 t2t2 … tntn t1t1 t2t2 …tntn Similarity matrix Sim(t i,t j ) = t i · t j  GO terms Genes g1g1 g2g2 …gngn t1t1 t2t2 … tntn tjtj titi 

23 Analysis of co-occurring GO terms  Annotation database GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn g2g2 t2t2 t7t7 t9t9 … t3t3 t7t7 t9t9 g5g5 t5t5 t 2 -t 7 1 t 2 -t 9 1 t 7 -t 9 2 … t5t5t5t51 t7t7t7t72 t9t9t9t92 …

24 Analysis of co-occurring GO terms u Statistical analysis: test independence l Likelihood ratio test (G 2 ) l Chi-square test (Pearson’s χ 2 ) u Example from GOA (22,720 annotations) l C [BP]Freq. = 588 l C [MF]Freq. = 53 presentabsentTotalpresent absent721,58322,132 total5322,12522,720 GO: immune response GO: chemokineactivity Co-oc. = 46 G 2 = p < 0.000

25 Association rule mining  Annotation database GO terms Genes g1g1 g2g2 … gngn t1t1 t2t2 …tntn g2g2 t2t2 t7t7 t9t9 transaction apriori Rules: t 1 => t 2 Confidence: >.9 Support:.05

26 Example of associations (GO) u Vector space model l l MF:ice binding l l BP:response to freezing u u Co-occurring terms l l MF:chromatin binding l l CC:nuclear chromatin u u Association rule mining l l MF:carboxypeptidase A activity l l BP:peptolysis and peptidolysis

Discussion and Conclusions

28 Combine methods u Affordable relations l Computer-intensive, not labor-intensive u Methods must be combined l Cross-validation l Redundancy as a surrogate for reliability l Relations identified specifically by one approach n False positives n Specific strength of a particular method u Requires (some) manual curation l Biologists must be involved

29 Limited overlap among approaches u Lexical vs. non-lexical u Among non-lexical VSM COC ARM [Bodenreider & al., PSB, 2005]

30 Reusing thesauri u First approximation for taxonomic relations l No need for creating taxonomies from scratch in biomedicine u Beware of purpose-dependent relations l Addison’s disease isa Autoimmune disorder u Relations used to create hierarchies vs. hierarchical relations u Requires (some) manual curation [Wroe & al., PSB, 2003] [Hahn & al., PSB, 2003]

31 Formal vs. Casual u Formal ontology l Provides a framework for building sound ontologies l Too labor-intensive for building large ontologies u Casual ontology l Usually unsuitable for reasoning l Tools for automatic acquisition available What is not useful Formal ontology = righteousFormal ontology = righteous Casual ontology = sloppy Casual ontology = sloppy

32 Formal and Casual u Formal ontology l Provides a framework which can be used as a reference l Help us think clearly (?) about n Concepts n Relations (e.g., isa: is a kind of / is an instance of) u Casual ontology l Supported by “cheap” (but formal) methods l Extracted from large amounts of data l Helps populating the framework from formal ontology

33 Combining formal and casual u Formal ontology l Provides a framework for building sound ontologies l Too labor-intensive for building large ontologies l Can benefit from loosely defined ontologies u Casual ontology l Usually unsuitable for reasoning l Tools for automatic acquisition available l Can benefit from formal ontology n Organization n Validation

34 Casual ontology as a bridge u Casual ontology l Speaks the language of biologists n Extracted from text or terminologies l Passes (part of) the rigorous framework of formal ontology on to biologists u Casual ontologist l Not a sloppy ontologist n Uses the formal methods of casual ontology l Mediator between formal ontology and biology

Medical Ontology Research Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA