MetaCoDe A GATE PLUGIN FOR TAGGING MEDICAL CORPORA IN FRENCH WITH CONTROLED TERMINOLOGIES Thierry Delbecque Pierre.

Slides:



Advertisements
Similar presentations
COI Architecture? Web Enabling Standard Patient-Model Searches in Disparate EMR Systems By Dan CorwinDan Corwin November 2007.
Advertisements

Samsung Smart TV is a web-based application running on an application engine installed on digital TVs connected to the Internet.
An Introduction to GATE
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Mini Project Seminar on Pizza Ordering Application for Android
U. S. National Library of Medicine NLM Indexing Initiative Tools for NLP: MetaMap and the Medical Text Indexer Natural Language Processing: State of the.
Who am I Gianluca Correndo PhD student (end of PhD) Work in the group of medical informatics (Paolo Terenziani) PhD thesis on contextualization techniques.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.
1 CS 426 Senior Projects Chapter 1: What is UML? Chapter 2: What is UP? [Arlow and Neustadt, 2002] January 26, 2006.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Survey of Semantic Annotation Platforms
Computing on the Cloud Jason Detchevery March 4 th 2009.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
PLATFORM INDEPENDENT SOFTWARE DEVELOPMENT MONITORING Mária Bieliková, Karol Rástočný, Eduard Kuric, et. al.
Information Extraction From Medical Records by Alexander Barsky.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
Object Management Group (OMG) Specifies open standards for every aspect of distributed computing Multiplatform Model Driven Architecture (MDA)
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Model Driven Development An introduction. Overview Using Models Using Models in Software Feasibility of MDA MDA Technologies The Unified Modeling Language.
MDA and Security October 12, 2006 FAU Secure Systems Group Patrick Morrison.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
SIMO SIMulation and Optimization ”New generation forest planning system” Antti Mäkinen & Jussi Rasinmäki Dept. of Forest Resource Management.
QCAdesigner – CUDA HPPS project
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
MedKAT Medical Knowledge Analysis Tool December 2009.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
 Programming - the process of creating computer programs.
U. S. National Library of Medicine The Current State of MetaMap and MMTx UMLS Webcast Alan (Lan) R. Aronson Lister Hill Center/NLM/NIH
The UMLS Semantic Network Alexa T. McCray Center for Clinical Computing Beth Israel Deaconess Medical Center Harvard Medical School
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
TCSS 342 Autumn 2004 Version TCSS 342 Data Structures & Algorithms Autumn 2004 Ed Hong.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
EMC Graduation Competition for Turkey, Middle East and Africa.
University of Maryland Baltimore County
Language Identification and Part-of-Speech Tagging
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
PRINCIPLES OF COMPILER DESIGN
Done By: Ashlee Lizarraga Ricky Usher Jacinto Roches Eli Gomez
SysML 2.0 Formalism: Requirement Benefits, Use Cases, and Potential Language Architectures Formalism WG December 6, 2016.
CS 153: Concepts of Compiler Design August 29 Class Meeting
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Information Retrieval and Web Search
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Information Retrieval and Web Search
Using UMLS CUIs for WSD in the Biomedical Domain
Compiler 薛智文 TH 6 7 8, DTH Spring.
Chapter 2: The Linux System Part 1
Lesson Objectives Aims Key Words Compiler, interpreter, assembler
Overview of big data tools
Compiler 薛智文 TH 6 7 8, DTH Spring.
Computational Linguistics: New Vistas
Introduction to Information Retrieval
CS246: Information Retrieval
Compiler 薛智文 M 2 3 4, DTH Spring.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

MetaCoDe A GATE PLUGIN FOR TAGGING MEDICAL CORPORA IN FRENCH WITH CONTROLED TERMINOLOGIES Thierry Delbecque Pierre Zweigenbaum

Motivations: To provide a UMLS mapping tool that can process large corpora in French: ● Operating at an high enough speed ● Able to operate 'on the fly', for example for processing stream of data from Internet ● Using limited computing resources ● Saving resources for others algorithms such as Machine Learning ● Built in an open platform and easily integrated in bigger applications ● Scalable and easily maintainable

Principles: ● To implement the software as a GATE java plugin, in order to ensure inter- operability of the tagger with other software components (through the java API, or through the GATE standard representation of annotated texts) ● To code linguistic preprocessing (tokenization, noun phrases extractions, etc) with GATE processing resources (JAPE Grammars,...) ● To code the core extraction algorithm in C++, to speedup the computation time ● In memory loading of strictly necessary knowledge resources as hash tables: no file or data base accesses during the tagging process ● To rely only on the UMLS Metathesaurus strings to handle variant generation (no specific variant generation algorithm

Initial Corpus POS extraction (Tree Tagger, Gazeeter, JAPE) NP Extraction (JAPE) Core Tagger (C++ native library) Processing Resources Annotations UMLS Data Bases Annotations MetaCoDe Annotations Language resources Architecture Visualization resources MetaCoDe Browser

Extracted In-Memory Resources Ad hoc resources can be extracted according to the processed corpora and loaded in memory: ● WDSUI : mapping from individual words to SUI strings; ● CUISUI : mapping from each SUI to its corresponding CUI; ● SUILENGTH : lengths (token number) of each SUI; ● CUISTY : semantic types of each CUI. Specific vocabularies can be selected.

Mapping Engine Algorithm Step 1: Noun Phrase Extraction (GATE PR). Based on POS extraction and JAPE grammar => each noun phrase is latter represented as a bag of words {w 1,..., w n } Step 2 : For each noun phrase, selection of candidates SUI's based on the bag of words => C 1 = {sui 1,..., sui p } Step 3 : Build a lattice on C1; pruning of the lattice to keep the SUI's with wider coverage of the noun phrase without being too specific => C 2 Last Step : for each element of C 2, retrieves its corresponding CUI and semantic types from UMLS data base

Mapping Engine Algorithm SUIs selection for a given Noun Phrase Noun Phrase Eg “occult nodal metastase” SUI1 (S = “nodal”) SUI2 (S = “metastasis” SUI (S = “metastasis to the kidney”) SUI (a) : Build the lattice of all SUIs That contains at least one word of the NP (b) Pruning. Keeps only those SUI S such that |S| <= |S ∩ NP| + tolerance => C 1 = {SUI1, SUI2}

Evaluation Motivation : to evaluate the price to pay for not computing variants generation as is currently done in Metamap; Principle : to run MetaMap on an English corpus, and to use MetaMap as the gold standard; compare the output of both tools on a sample of 30 independent abstracts issued from Medline;

Evaluation: metrics MATCH : a common decision has been issued by both tools; AMBIGUOUS : given a Noun Phrase, MetaCoDe produced issued incorrect decisions along with correct ones; BAD : given a Noun Phrase or a part of a Noun Phrase, MetaMap took a decision, according to which MetaCoDe produced only bad output; MISS: no decision was output by MetaCoDe when MetaMap has been producing a mapping.

Evaluation Results ● P = | MATCH | / ( | MATCH | + | BAD | ); (pseudo-precision) ● R = | MATCH | / ( | MATCH | + | BAD | + | MISS | ); (pseudo-recall) ● F = 2 x P x R / (P + R)

Running MetaCoDe Processing Resources Chain

MetaCoDe Browser Tagged Text Interactive Tags browser Corpora Annotations Set SQL Connection to UMLS Data Base UMLS info on Selected concept

Conclusions ● The good precision rate is due to the rather conservative algorithm of MetaCoDe ● Relying to the Metathesaurus only for variants generation leads to 0.76 “recall” rate but could be improved by allowing some verbal forms to be tagged as well ● The core tagging process is fast enough to process big corpora or to add extra algorithms for specific applications*, though GATE platform slows down the process significantly ● Young project, still a lot to be done on JAPE grammars and on the core tagger (not industrial today) ● The code is open source and available under GPL license * 26 minutes spent for processing 7260 abstracts from MEDLINE ( words), when called outside of GATE on a Pentium 4 biprocessor, 3Ghz, 1.5Go, Linux.

Bibliography Rindflesch TC and Aronson AR Semantic Processing in Information Retrieval. Proc Annu Symp Comput Appl Med Care, Charles Safran (ed.), 1993 Bodenreider 0, Nelson SJ, Hole WT and Chang FH Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies. Proc AMIA Symp Aronson AR, Mork JG, Gay CW, Humphreys SM andRogers WJ The NLM Indexing Initiative's Medical text Indexer. MedInfo Humphreys BL, Lindberg DA, Schoolman HM and Barnett GO The Unified Medical Language System: An informatic research collaboration. J Am Med Inform Assoc, 1998;5(1) Aronson AR Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp Lindberg DA, Humphreys BL and McCray AT The Unified Medical Language System. Methods Inf med, 1993;32(2) Delbecque T and Zweigenbaum P MetaCoDe: a lightweight UMLS mapping tool. Actes 11th Conference on Artificial Intelligence in Medicine Europe, Heidelberg, 2007