BioNLP, Information Extraction from Radiology Reports

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Statistical NLP: Lecture 3
16 November 2004Biomedical Imaging BMEN Biomedical Imaging of the Future Alvin T. Yeh Department of Biomedical Engineering Texas A&M University.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Information Extraction from Clinical Reports Wendy W. Chapman, PhD University of Pittsburgh Department of Biomedical Informatics.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
February 13, 1997CWU B.Kovalerchuk1 DESIGN OF CONSISTENT SYSTEM FOR RADIOLOGISTS TO SUPPORT BREAST CANCER DIAGNOSIS.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Uncovering Age-Specific Invasive and DCIS Breast Cancer Rules Using Inductive Logic Programming Houssam Nassif, David Page, Mehmet Ayvaci, Jude Shavlik,
ELN – Natural Language Processing Giuseppe Attardi
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
9/8/20151 Natural Language Processing Lecture Notes 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Survey of Semantic Annotation Platforms
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Introduction to Breast Imaging BREAST RAD LAB Directions: Please answer all the questions prior to interactive conference. 1.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Flexible Text Mining using Interactive Information Extraction David Milward
Extracting BI-RADS Features from Portuguese Clinical Texts H. Nassif, F. Cunha, I.C. Moreira, R. Cruz- Correia, E. Sousa, D. Page, E. Burnside, and I.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
How will you approach the 35-year old, with a 2x2x2cm, firm, mobile, well-circumscribed non-tender mass on her R breast?
Managed by UT-Battelle for the Department of Energy Learning Cue Phrase Patterns from Radiology Reports Using a Genetic Algorithm Robert M. Patton, Ph.D.
Introduction to CL & NLP CMSC April 1, 2003.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
A.F.K. by SoTel. An Introduction to SoTel SoTel created A.F.K., an Android application used to auto generate text message responses to other users. A.F.K.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Ontology based Information Extraction
MedKAT Medical Knowledge Analysis Tool December 2009.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Information Extraction for Clinical Data Mining: A Mammography Case Study H. Nassif, R. Woods, E. Burnside, M. Ayvaci, J. Shavlik and D. Page University.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Open Health Natural Language Processing Consortium
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Consumer Health Question Answering Systems Rohit Chandra Sourabh Singh
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Statistical NLP: Lecture 3
Natural Language Processing (NLP)
Machine Learning in Natural Language Processing
Computerized Decision Support for Medical Imaging
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

BioNLP, Information Extraction from Radiology Reports Emilia Apostolova College of Computing and Digital Media DePaul University

BioNLP – conferences and shared tasks Pacific Symposium on Biocomputing Intelligent Systems for Molecular Biology Association for Computational Linguistics North American Association for Computational Linguistics BioNLP BioCreative TREC Genomics IClef

Information Extraction (in BioMedicine) The NLP Pipeline Lexical Analysis – tokenization, morphological analysis, linguistic lexicons. Syntactic Analysis – Part of Speech Tagging, Chunking, Parsing. Semantic Analysis – Lexical Semantic Interpretation, Semantic Interpretation of Utterances.

NLP Pipeline Frameworks GATE - General Architecture for Text Engineering. Apache UIMA - Unstructured Information Management Application. Geneways - a system for automatically extracting, analyzing, visualizing and integrating molecular pathway data from the research literature. PASTA - Protein Structures and Information Extraction from Biological Texts.

Lexical Analysis - Tokenization Segmenting text into linguistic tokens – words and sentences. Abbreviations - The Study was conducted within the U.S. Apostrophes - IL-10's cytokine synthesis inhibitory activity Hyphenation - co-operate, cooperate Multiple formats: 464,285.23 and 464295.23 Sentence boundary detection - :, ;, -

Lexical Analysis – Morphological analysis Link surface variants of a lexical element to its canonical base form. E.g. inflections (activat-es, activat-ed, activat-ing), derivations (activation). Porter stemmer – lexicon-free approach. Finds longest match of a word to a a list of English derivational and inflectional suffixes. Two-level morphology – a finite state based approach that applies a series of parallel transducers to input tokens. (fly -> flies)

Syntactic Level – Part of Speech Tagging activation – POS noun, singular activate – POS verb, present non-3d person singular active – POS adjective report?

Syntactic Level - Parsing A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. The Stanford Dependency Parser - a Java implementation of probabilistic natural language parsers, trained on the Penn Treebank.

Semantic Level – Lexical Interpretation Selectional Restrictions: transitive verbs: inhibit [something], transcribe [something] semantic restrictions: inhibit [Process], transcribe [Nucleic Acid] Syntactically admissible, but semantically invalid: to inhibit amino acids to transcribe cell growth

Discourse Level - Pragmatics Discourse referents; what entities does a given message refer to? What background knowledge is needed to understand a given message? How do the beliefs of speaker and hearer interact in the interpretation of a message? What is a relevant answer to a given question? Summarization, Translation, Dialog Systems, Natural Language Generation.

Lexical resources for (Bio)NLP Princeton Wordnet NLM UMLS lexicon and metathesaurus. The Open Biomedical Ontologies

Text and Image Integration

Automatic Image Annotation

Automatic Image Annotation Where? Woman (Population Group), Right breast (Body Part, Organ, or Organ Component)‏ How? Mammography (Diagnostic Procedure)‏ What? Calcification (Pathologic Function), Lesion (Finding), Carcinoma, Papillary (Neoplastic Process)‏

IE from Clinical Texts – Radiology and Pathology Reports Northwestern University Medical School Department of Radiology Imaging Informatics

Radiology Reports

Sample Radiology Report Patient Name: XXXXXXX, XXXXX Medical Record Number: XXXXXXXXXX DOB: XXXX.XX.XX Sex: F Accession Number: XXXXXXXX Study Requested: DIG MAMMOGRAM SCREENING (3300000) Scheduled Date and Time: XXXX.XX.XX 13:02:00.0000 Requesting Physician: XXXXXXX, Reason for Exam: V76.12 ----------------------------Radiological Report--------------------------------- Comparison is made to previous exams dated XX/XX/XX. CLINICAL HISTORY: Seventy-two year old woman for screening exam. Patient has a family history of breast cancer, sister age sixty years old. Patient has a history of a previous left breast benign biopsy. TECHNIQUE: Mammograms were obtained using digital technique. FINDINGS: There is dense fibroglandular tissue bilaterally. No dominant masses or clustered microcalcifications suggestive of malignancy are seen. 1. NO SPECIFIC FEATURES OF MALIGNANCY SEEN EITHER BREAST. 2. NO SIGNIFICANT CHANGE WHEN COMPARED WITH PRIOR STUDIES. 3. ANNUAL SCREENING MAMMOGRAM IS RECOMMENDED. CODE (1): NEGATIVE Attending Radiologist: XXXXXXX, MD Date Signed off: XXXXXX, Transc. by: NS

NLP for Clinical Texts Document retrieval – case finding. Subject recruitment – identify patients that can benefit from a study. Surveillance – monitoring disease outbreaks. Discovery of disease-drug associations. Discovery of disease-finding associations.

IE from Radiology Reports Automatic Section Segmentation Demographics History Comparison Technique Findings Impression Recommendation Sign off

Dataset 215,000 free-text radiology reports selected randomly from 3 million reports over period of 9 years and representing 24 different types of diagnostic procedures.

Method – Training Set Hand-crafted rules for automatic extraction of a training set. Common boundary patterns: e.g. section Findings – text between known section headers and another known headings: ^ (finding | observation | discussion)s?: ^ (\W*)(finding | observation | discussion)s?(\W*)$ 3,000 automatically segmented “high- confidence” radiology reports, containing all 8 sections of interest.

Method Classification task - each sentence from a radiology report is assigned to one of 8 pre- defined report sections.

Sentence features used for training a classifier. Sentence Orthography Possible orthographic types are All Capitals, Mixed Case, or presence of a Header pattern, such as a phrase at the beginning of a line followed by a colon. Previous Sentence Boundary Formatting boundary separating the current and previous text sentences. Possible values are white space containing new lines, white space without new lines, non-alphabetic characters, or the beginning of the file. Following Sentence Boundary Formatting boundary separating the current and next text sentences. Possible values are white space containing new lines, white space without new lines, non-alphabetic characters, or the end of the file. Cosine Vector Distance Distance from the current sentence to each of the eight sections' word vectors. Exact Header Match This feature specifies if the sentence contains a header identified as belonging to one of the sections in the training data.

Work in Progress Identify named entities within sections using a controlled vocabulary – findings, diseases, observations, anatomical organs, imaging modalities. Negation Discovery. Identify relationships between named entities of interest, for example what observations are associated with a diagnosis. Use radiology report text to support automatic annotation of medical images.

Q/A