NLP in Public Health: What May Work

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
Chapter 5: Introduction to Information Retrieval
Machine Learning Instance Based Learning & Case Based Reasoning Exercise Solutions.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Text Mining of Medical Documents Michael Elhadad - Raphael Cohen Dept of Computer Science.
A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Search Engines and Information Retrieval
© Franz Kurfess Project Topics 1 Topics for Master’s Projects and Theses -- Winter Franz J. Kurfess Computer Science Department Cal Poly.
Information Extraction from Clinical Reports Wendy W. Chapman, PhD University of Pittsburgh Department of Biomedical Informatics.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Introduction to Machine Learning Approach Lecture 5.
Chapter 5: Information Retrieval and Web Search
Mining Officially Unrecognized Side effects of drugs by combining Web Search and Machine learning Carlo Carino, Yuanyuan Jia, Bruce Lambert, Patricia West.
Clinical Pharmacy Basma Y. Kentab MSc..
Medical Informatics Basics
Clinical Information Resources Sandra A. Martin, M.L.I.S. Health Sciences Resource Coordinator Instructor of Library Services John Vaughan Library Room.
Evaluating Classifiers
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Computers in Healthcare Jinbo Bi Department of Computer Science and Engineering Connecticut Institute for Clinical and Translational Research University.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
HYGIA: Design and Application of New Techniques of Artificial Intelligence for the Acquisition and Use of Represented Medical Knowledge as Care Pathways.
Chapter 6: Information Retrieval and Web Search
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
I Robot.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Computational Linguistics Courses Experiment Test.
Developing outcome prediction models for acute intracerebral hemorrhage patients: evaluation of a Support Vector Machine based method A. Jakab 1, L. Lánczi.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Introduction to Health Informatics Leon Geffen MBChB MCFP(SA)
Brief Intro to Machine Learning CS539
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Showcasing work by Jonnageddala, Liaw, Ray, Kumar, Chang, and Dai on
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
Information Retrieval (in Practice)
8. Causality assessment:
An Artificial Intelligence Approach to Precision Oncology
CSSE463: Image Recognition Day 11
Clinical NLP in North Germanic Languages
Text Classification Seminar Social Media Mining University UC3M
Terminology problems in literature mining and NLP
Source: Procedia Computer Science(2015)70:
Walden University Carrie Vanzant February 7, 2010
Information Retrieval
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
What is Pattern Recognition?
CSSE463: Image Recognition Day 11
Dept. of Computer Science University of Liverpool
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CS246: Information Retrieval
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Kostas Kolomvatsos, Christos Anagnostopoulos
Introduction to Sentiment Analysis
Digital Biomarkers – Patient data mining & precision medicine Stefan Schulz, Medical University of Graz Donausymposium Vienna, March 14, 2018.
Text Mining Application Programming Chapter 9 Text Categorization
NAÏVE BAYES CLASSIFICATION
Text Mining of Medical Documents
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

NLP in Public Health: What May Work Ninad Mishra Dave Cummo Jim Arnzen

Monkeys and Bananas We gave the monkeys the bananas because they were hungry We gave the monkeys the bananas because they were over-ripe The sentences We gave the monkeys the bananas because they were hungry and We gave the monkeys the bananas because they were over-ripe have the same surface grammatical structure. However, the pronoun they refers to monkeys in one sentence and bananas in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas Source: Unknown

NLP Applications Question answering (information retrieval) Text/Document classification Text Mining Find something that was not known before Information extraction Extract useful information from public health grey literature Source: CNLP

Classification with NLP Statistical Methods Bayesian Artificial Neural Networks Support Vector Machines Rule Based Classification System Keyword approach Negation strategies

Naïve Bayesian Classification Statistical classification method Uses supervised learning Requires a human-classified training set Assumes that the occurrence of each individual word is independent of every other word

Naïve Bayesian Classification calculations Training corpus Spam Non-spam Testing corpus spam and non-spam Word frequency statistics stored Spam Non-spam Spam identified Valid non-spam identified (SME-classified) (unclassified) Spam Filtering Trial

Preliminary Results Precision: .864 Recall: .810 F-Measure: .836 ** Using Classifier4j Package – Modified Naïve Bayesian Precision: .864 Recall: .810 F-Measure: .836

Preliminary Results Precision: .973 Recall: .993 F-Measure: .983 **Using True Naïve Bayesian Classification Precision: .973 Recall: .993 F-Measure: .983

Precision and Recall Precision is the fraction of retrieved documents that are relevant Recall is the fraction of relevant documents that are retrieved

Confusion Matrix

NLP Experience @CDC ‘RiskBot’ POCs MSM online text/profile assessment project i2b2 (Informatics for Integrating Biology & the Bedside) medical discharge summary classification challenge MySpace data elements for public health message tailoring

i2b2 NIH-funded National Center for Biomedical Computing based at Partners HealthCare System i2b2 issues ‘challenges’ to correctly classify health records based on conditions and co-morbidities and invites various institutions/teams to compete Results shown on next few slides are derived from training set data

Statistical Analysis-Textual Judgment

Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro CAD 0.3911 0.4065 0.4522 0.2946 CHF 0.5153 0.4915 0.3811 0.411 Depression 0.6154 0.5619 0.6234 0.5261 Diabetes 0.4105 0.3894 0.4943 0.2865 GERD 0.2268 0.3095 0.4719 0.1302 Gout 0.2349 0.3621 0.3318 0.1915 Hypertension 0.4883 0.4025 0.5652 0.3566 Hypertriglyceridemia 0.5789 0.5093 0.5946 0.3975 OA 0.6181 0.5644 0.6178 0.5349 Obesity 0.2863 0.3942 0.5708 0.2092 OSA 0.1525 0.3145 0.5075 0.1232 PVD 0.7599 0.5958 0.6429 0.6056 Venous Insufficiency 0.739 0.513 0.5884 0.4671 System Textual 0.462846 0.447277 0.5263 0.348769

Rule Based NLP System Looks for keywords associated with each morbidity Tries to identify the assertion type in which each keyword appears Assertion Types: positive, negative, questionable Text preprocessing Primarily removes text not relevant to patient’s current condition Documents classified using scoring algorithm

Classification Rule Based system Corpus: i2b2 Obesity Challenge

Positive weighted Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro Obesity 0.9767 0.7426 0.4924 Depression 0.9835 0.9758 0.9824 0.979 Hypertri. 0.9959 0.9979 0.9167 0.9535 Gallstones 0.9712 0.6004 0.6064 0.6033 OSA 0.9904 0.9448 0.655 0.671 Asthma 0.9973 0.9947 0.9992 0.9969 CAD 0.9404 0.7715 0.8017 0.7857 PVD 0.9877 0.978 0.9764 0.9772 Gout 0.9918 0.8767 0.8217 0.8452 Diabetes 0.9752 0.8636 0.8155 0.8369 CHF 0.9417 0.7698 0.8746 0.8038 Ven. Insuf. 0.9863 0.8504 0.9467 0.8923 GERD 0.9696 0.6233 0.8848 0.67 OA 0.9876 0.9664 0.9891 0.9773 Hyperchol. 0.9821 0.9906 0.9349 0.959 Hyperten. 0.9575 0.9143 0.85 0.8796 System Textual 0.7391 0.7969 0.765

Questionable weighted Q-WEIGHTED Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro Obesity 0.9712 0.4926 0.4891 0.4907 Depression 0.9835 0.9758 0.9824 0.979 Hypertriglyceridemia 0.9959 0.9979 0.9167 0.9535 Gallstones 0.9671 0.5462 0.5995 0.5655 OSA 0.989 0.8997 0.7104 0.6753 Asthma 0.9973 0.9947 0.9992 0.9969 CAD 0.9334 0.7283 0.8337 0.769 PVD 0.9863 0.9778 0.9715 0.9746 Gout 0.9918 0.8767 0.8217 0.8452 Diabetes 0.9738 0.8279 0.8502 0.8363 CHF 0.9361 0.7419 0.8411 0.7698 Venous Insufficiency 0.8504 0.9467 0.8923 GERD 0.9696 0.6233 0.8848 0.67 OA 0.9849 0.9698 0.9804 0.9749 Hypercholesterolemia 0.9807 0.8656 0.9341 0.8752 Hypertension 0.9479 0.8532 0.8456 0.8486 System Textual 0.9747 0.6954 0.8171 0.7404

Positive Weighted Assertion Types in Tie Y Q N Tie Breaker Judgment *

What May Account for the Difference ? Paucity of the data Pre-processing Negation context and assertions Inherent properties of a medical text Degree of differentiation Locality of information

Highly Differentiated Text Email A Email B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ Reviewed your presentation online pharma Jai called to say.. viagra

Text with Low Differentiation Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ CAD CAD

Text with Low Differentiation Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ …father was diagnosed with CAD …patient has CAD indeterminate positive for CAD

Text with Low Differentiation Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ …no family history of CAD …patient has CAD indeterminate positive for CAD

NLP in Public Health Major Problems Complexity of the underlying domains: Medicine, Microbiology, Social Sciences etc. Complexity of language Creating lexicons and rules: requires linguistic and computational expertise Idealized Language v. Language in Use Ambiguity Lack of large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to adapt to specific domains

Possible Solutions…. Focus on making limited domain specific systems Combine rule based & statistical approach as needed Focus initially on limited, “simple” tasks Focus on real language use under realistic conditions Progress made by making working systems and evaluating them rigorously

Examples of NLP in Public Health Decision Augmentation Service (CDC) Public Health Search engine (CNLP) Validation of clinical data submitted to Bio-Sense using NLP (Mayo Clinic) Information retrieval, text mining and knowledge synthesis for human genome epidemiology (National Office of Public Health Genomics)

NLP in Public Health Pneumonia detection from free text radiological reports (Mayo Clinic) Public health situational awareness tools for information fusion (BIC) Chief complaint based syndromic surveillance (BioSense) Vaccine adverse event detection (CDC)

Questions and Answers