MedKAT Medical Knowledge Analysis Tool December 2009.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
ISO DSDL ISO – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
SRDC Ltd. 1. Problem  Solutions  Various standardization efforts ◦ Document models addressing a broad range of requirements vs Industry Specific Document.
CS4025: Advanced Information Extraction. Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
I nformation Extraction from Radiology Reports: System Design and Implementation Information Model System Architecture – UIMA Automatic Report Segmentation.
Methodology Conceptual Database Design
Use Case Modelling Visual Annotator for studying ICU Notes Bacchus Beale.
UIMA Introduction SHARPn Summit June 11, 2012
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
10-1 aslkjdhfalskhjfgalsdkfhalskdhjfglaskdhjflaskdhjfglaksjdhflakshflaksdhjfglaksjhflaksjhf.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
De-identifying Pathology Reports for Pathology Informatics
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Open Health Natural Language Processing Consortium (OHNLP)
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Bringing “it” all Together !? Dean Djokic, ESRI David Maidment.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
CTAKES The clinical Text Analysis and Knowledge Extraction System.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
1 Guy Divita Qing Zeng-Treitler Salt Lake City VA, University of Utah School of Medicine Pragmatic Interoperability.
IFS310: Module 6 3/1/2007 Data Modeling and Entity-Relationship Diagrams.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
The IBM Rational Publishing Engine. Agenda What is it? / What does it do? Creating Templates and using Existing DocExpress (DE) Resources in RPE Creating.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Mayo cTAKES: UIMA Type System
Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Open Health Natural Language Processing Consortium
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Shock Progress & Direction. MetaMap Tokenized words for Mohammed – Enables him to test his new models for Pattern matcher Mallet Training Data for Laura.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
User Manual for Contact Management Customer Relationship Management (CRM) for Bursa Malaysia 2014 Version 1.0 | 4 September 2014.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Definition SpecIfIcatIons
cTAKES: Demo Clinical Text Analysis and Knowledge Extraction System
Data Normalization Architecture
Part of the Multilingual Web-LT Program
Definition SpecIfIcatIons
Presentation transcript:

MedKAT Medical Knowledge Analysis Tool December 2009

Overview ✤ MedKAT and MedKAT/p ✤ Developed at IBM, donated to OHNLP with Apache license V2.0 ✤ Goal: ✤ Identification of concepts and their attributes based on a standard or proprietary terminology/ontology ✤ “/p” adaptation to pathology reports – relation extraction ✤ UIMA-based, Modular, Generic, Expandable ✤ Terminology agnostic: able to plug in any terminology ✤ Easy adaptation to specific corpus and conventions ✤ Integration into institutional system ✤ Ongoing commitment to Research and Development

3 Core Components ✤ Document structure ✤ Syntactic tools (tokenization.. shallow parsing) ✤ Negation ✤ Concept identification ✤ Relationship extraction

4 Core Components ✤ Document structure ✤ Syntactic tools (tokenization.. shallow parsing) ✤ Negation ✤ Concept identification ✤ Relationship extraction

Document Structure ✤ Plain text or XML (e.g., CDA) ✤ Processes specific document section types (e.g., diagnosis) ✤ Detection of enumerated subsections (e.g., lists) ✤ Detection of formatting (e.g. bullets) ✤ Detection of relations between sections (e.g., coreference between corresponding lists appearing in different document sections) ✤ Making implicit conventions explicit (e.g. meaning of title)

Document Structure Annotators

7 Document Structure 16 Multiple document sections

8 Document Structure 17 Corresponding document subsections

9 Document Structure 18 Need to know document structure to be able to compute concept coreference during relation extraction

10 Core Components ✤ Document structure ✤ Syntactic tools (tokenization.. shallow parsing) ✤ Negation ✤ Concept identification ✤ Relationship extraction

Syntactic Structure Annotators

Tokenization Basic building block for subsequent annotators. The text: poorly-differentiated/undifferentiated could be tokenized as 1, 3, or 5 tokens:

Part of Speech Tagger ✤ OpenNLP POS tagger with standard models ✤ Domain adaptation: ✤ Entries from lexicon are pre-tagged ✤ Rule-based overwriting of tags for specific cases

14 Shallow Parser 32

Merging NP Types The shallow parser defines three types of noun phrase: 1. NP 2. NPP 3. NPList

Merging NP Types The NPMerger module creates NPCombined annotations to cover all types of noun phrases.

17 Core Components ✤ Document structure ✤ Syntactic tools (tokenization.. shallow parsing) ✤ Negation ✤ Concept identification ✤ Relationship extraction

Negation Annotators

Negation ✤ Keyword and syntactic analysis driven ✤ Set of keywords configurable via dictionary ✤ Type of syntactic phrase used to determine context is configurable

20 Core Components ✤ Document structure ✤ Syntactic tools (tokenization.. shallow parsing) ✤ Negation ✤ Concept identification ✤ Relationship extraction

Concept Identification Annotators

Concept Identification ✤ Lexicon entries can be added, changed, deleted ✤ Lexicon entry attributes can be added, changed, deleted ✤ Search parameters can be modified ✤ Post processing filters ✤ Tokenization of text and lexicon should be the same

Lexicon Entries ✤ A sample lexicon entry. The variant elements define all of the synonyms that can be matched during lookup. Attributes associated with “token” element apply to all variants, but can be overridden within individual variants (e.g., the “POS” attribute in some of these variant entries). <token canonical="colon, nos" CodeType="ICDO" CodeValue="C18.9" SemClass="Site" POS="NN">

Concept Identification Configuration ✤ Configured to find all matched entries, not just longest match, even if overlapping ✤ Case-insensitive ✤ Token order independent matching performed, e.g.: A B C = C A B ✤ Subsequent filtering used to remove unnecessary over-generated results

Concept Filters ✤ Remove: ✤ any duplicates over a single span ✤ generic terms like “tumor” if part of a longer term ✤ terms that contain other terms that were previously marked, such as a modifier

26 Core Components ✤ Document structure ✤ Syntactic tools (tokenization.. shallow parsing) ✤ Negation ✤ Concept identification ✤ Relationship extraction

Relationship Extraction Annotators

Relationship Extraction ✤ Find coreferences of both anatomical sites and histological diagnoses across document sections ✤ Discover relationships between named entities and build knowledge model: ✤ Tumors (primary, metastatic) ✤ Gross description ✤ Lymph nodes

Knowledge Model ✤ Benefits ✤ Summarization ✤ Comparison ✤ Change detection ✤ Temporal progression of disease ✤ Validation ✤ Manual annotation of pathology reports and clinical notes

The MedKAT/p Pipeline

MedKAT/p Annotator Pipeline

MedKAT/p Pipeline ✤ The full processing pipeline brings together all of the MedKAT components ✤ Used a manually annotated gold standard corpus of 302 documents: 201 documents for training, 101 for testing ✤ UIMA CAS can be output as database load file, XML, or other format using a UIMA CAS Consumer module

Concept Extraction Results Training Instances Test Instances F-Score Anatomical Site 1, Histology Size Date Grade

Model Extraction Results Training Instances Test Instances F-Score Gross Description Lymph Nodes Primary Tumor Metastatic Tumor

Summary ✤ MedKAT and MedKAT/p were developed at IBM, donated to OHNLP with Apache license V2.0 ✤ Apache UIMA based solution for flexible, expandable system ✤ Concepts are identified, with their associated attributes, based on a standard or proprietary terminology/ontology ✤ The “/p” version has additional components for processing pathology reports