Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using.

Slides:



Advertisements
Similar presentations
Profiles Construction Eclipse ECESIS Project Construction of Complex UML Profiles UPM ETSI Telecomunicación Ciudad Universitaria s/n Madrid 28040,
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Bio-Medical Interaction Extractor Syed Toufeeq Ahmed ASU.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Test suite design for biomedical ontology concept recognition systems Kevin Bretonnel.
An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and.
C. Varela; Adapted w/permission from S. Haridi and P. Van Roy1 Declarative Computation Model Defining practical programming languages Carlos Varela RPI.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Survey of Semantic Annotation Platforms
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Lexical Analysis Hira Waseem Lecture
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
1 Syntax In Text: Chapter 3. 2 Chapter 3: Syntax and Semantics Outline Syntax: Recognizer vs. generator BNF EBNF.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
SSO: THE SYNDROMIC SURVEILLANCE ONTOLOGY Okhmatovskaia A, Chapman WW, Collier N, Espino J, Conway M, Buckeridge DL Ontology Description The SSO was developed.
From Allesandro Lenci. Linguistic Ontologies Mikrokosmos (Nirenburg, Mahesh et al.) Generalized Upper Model (Bateman et al.)Generalized Upper Model WordNet.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
The Interpreter Pattern (Behavioral) ©SoftMoore ConsultingSlide 1.
Domain Model A representation of real-world conceptual classes in a problem domain. The core of object-oriented analysis They are NOT software objects.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Compiler Construction CPCS302 Dr. Manal Abdulaziz.
Understanding Naturally Conveyed Explanations of Device Behavior Michael Oltmans and Randall Davis MIT Artificial Intelligence Lab.
NATURAL LANGUAGE PROCESSING
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Chapter 1 Introduction.
A knowledge-based text annotation tool
Biomedical Text Mining and Its Applications
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Chapter 1 Introduction.
Natural Language Processing (NLP)
Terminology problems in literature mining and NLP
Using UMLS CUIs for WSD in the Biomedical Domain
Chapter 3: Lexical Analysis
Category-Based Pseudowords
A Short Tutorial on Causal Network Modeling and Discovery
Extracting Semantic Concept Relations
Complex Sentence Processor
CS 3304 Comparative Languages
COMPUTATIONAL PROCESS REPRESENTATION IN A KNOWLEDGE BASE
Biomedical Language Processing: What's Beyond PubMed?
Natural Language Processing (NLP)
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Text Analytics in ITS 2.0: Annotation of Named Entities
Natural Language Processing (NLP)
Presentation transcript:

Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine Using ontologies for text processing

Overview Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problem Describe the lexical ambiguity problem and its central importance in natural language processing Demonstrate how GO, combined with Direct Memory Access Parsing, provides a simple solution to some instances of this problem Argue no alternative is likely to work as well

Lexical Ambiguity A word (character string) means different things in different contexts –How can a program disambiguate (tell which is meant)? Widespread problem even in “simple” bioNLP –DNA vs. mRNA vs. protein [Hatzivassiloglou et al. 2001] –Gene symbol vs. non-gene acronym [Pustejovsky et al. 2001], [Chang et al. 2002], [Liu and Friedman 2003], [Schwartz and Hearst 2003] –Gene/product vs. any other noun [Tanabe and Wilbur, 2002]

An inescapable problem Many gene symbols map to more than one gene! Examined set of 94,914 symbols –official + aliases / Homo + Mus / LocusLink + Netaffyx 7,042 symbols had exact matches to  2 genes –Problem gets worse with reasonable inexact matches, e.g. roman vs. arabic numerals, alternative hyphenation These matches involved 9,723 genes –<14,084 since some had multiple ambiguous matches

A particular example “Hunk” can be a – Cell type : human natural killer – Gene : hormonally upregulated Neu-associated kinase – Medical abbreviation : radiographic/orthopedic joint classification system – Non-technical English : a large lump, piece, or portion All occur in Medline documents…. (e.g. “hunk of metal” in article on ambulance design)

How do ontologies help? The idea that knowledge is relevant to understanding words in context is controversial only among linguists, but… Direct Memory Access Parsing (DMAP) [Martin, 1991] [Fitzgerald, 2000] technique demonstrates the power of knowledge-based method for disambiguation GO & similar efforts make DMAP (or other knowledge-based methods) practical today

What is DMAP? Conceptual parser –Maps from text to conceptual representations organized in packaging and abstraction hierarchies (like GO) –In contrast to: pure syntactic parsers, pattern matching and machine learning systems Conceptual representations include lexical patterns that specify how to recognize the concept in text –Patterns consist of text literals and/or references to other concepts –Organized around concepts, not words; no independent lexicon. Recognition creates expectations for related concepts

A real example ID: cell-type-HUNK IS-A: cell-type lex: human natural killer HUNK RESULTS ID: gene IS-A: gene lex: hormonally upregulated Neu-associated kinase HUNK hormonally upregulated neu tumor-associated kinase ID: GO lex: transcription expression ID: gene-expression slots: expressed-item: gene mechanism: expression lex: (gene) (expression) “…Hunk expression is restricted to subsets of cells…” [ Gardner et al. 2000]

(parse ‘(Hunk)) e-gene begin: 1 end: 1 e-cell-type-HUNK begin: 1 end: 1 (parse ‘(Hunk expression)) c-gene-expression-1 begin: 1 end: 2 expressed-item: e-gene begin: 1 end: 1 mechanism: GO: begin: 2 end: 2 DMAP output with and without context Hunk alone: ambiguous Hunk expression: not ambiguous

DMAP can handle much more complex constructions “Hunk is expressed in mouse epithelial cells during cell proliferation.” c-localized-gene-expression expressed-item: e-gene mechanism: GO: where: c-epithelial-cell taxon: ncbi_10090 when: GO: But uses our enriched knowledge-base, not just GO

Even just DMAP/GO is a big win Recall 7,042 ambiguous symbols for 9,723 genes Straightforward to disambiguate symbols that map to 2 or more genes when: –Each ambiguous gene referent has GO annotations, and –There is no overlap between the annotations for the genes 3,333 of the symbols (for 4715 of the genes) have this feature – nearly half the problem is solved!

Compare the alternatives Statistical or machine learning approaches –Must avoid being fooled by word “cells” in example –Scalability: need statistics for many covariates of every ambiguous word; doesn’t exploit the abstraction hierarchy Full syntactic parse doesn’t disambiguate at all! Cascaded FST’s, pattern-matching, etc. –Where is source of knowledge for these? –Much DMAP lexical information can be taken directly from GO (and LocusLink, etc.)

Acknowledgments Philip V. Ogren Daniel J. McGoldrick Christoffer S. Crosby Jens Eberlein George K. Acquaah-Mensah I/NET’s ( CM / CMP software Support from Wyeth Genetics Institute, NIAAA

Biognosticopoea representation of the hunk gene

Attachment ambiguity –These findings suggest that FAK functions in the regulation of cell migration and cell proliferation. (Gilmore and Romer 1996:1209) –What does FAK do? ALMOST RIGHT: FAK functions in the regulation of cell migration FAK functions in cell proliferation RIGHT: FAK functions in the regulation of cell migration FAK functions in the regulation of cell proliferation

Attachment ambiguity GO isA go-process lex: cell migration GO isA go-process lex: cell proliferation GO isA go-process lex: regulation of cell proliferation regulation of ((go-process) and)* cell proliferation GO lex: regulation of cell migration regulation of ((go-process) and)* cell migration

Attachment ambiguity (parse ‘(These findings suggest that FAK functions in the regulation of cell migration and cell proliferation)) GO:30334 begin: 9 end: 12 GO: begin: 9 end: 15

What do we have so far? Gene Ontology UMLS MeSH …

What more do we need? Family Location –Macroanatomical –Subcellular localization Structure Function –Disease associations –Protein/protein interactions –…..

Where can we get it? GO definitions UMLS definitions MeSH notes Biomedical literature

If you don’t like DMAP…. full syntactic parse first cascaded FST’s “a little syntax, a little semantics” machine learning pattern-matching All can benefit from ontology/KB