NLP in Public Health: What May Work Ninad Mishra Dave Cummo Jim Arnzen
Monkeys and Bananas We gave the monkeys the bananas because they were hungry We gave the monkeys the bananas because they were over-ripe The sentences We gave the monkeys the bananas because they were hungry and We gave the monkeys the bananas because they were over-ripe have the same surface grammatical structure. However, the pronoun they refers to monkeys in one sentence and bananas in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas Source: Unknown
NLP Applications Question answering (information retrieval) Text/Document classification Text Mining Find something that was not known before Information extraction Extract useful information from public health grey literature Source: CNLP
Classification with NLP Statistical Methods Bayesian Artificial Neural Networks Support Vector Machines Rule Based Classification System Keyword approach Negation strategies
Naïve Bayesian Classification Statistical classification method Uses supervised learning Requires a human-classified training set Assumes that the occurrence of each individual word is independent of every other word
Naïve Bayesian Classification calculations Training corpus Spam Non-spam Testing corpus spam and non-spam Word frequency statistics stored Spam Non-spam Spam identified Valid non-spam identified (SME-classified) (unclassified) Spam Filtering Trial
Preliminary Results Precision: .864 Recall: .810 F-Measure: .836 ** Using Classifier4j Package – Modified Naïve Bayesian Precision: .864 Recall: .810 F-Measure: .836
Preliminary Results Precision: .973 Recall: .993 F-Measure: .983 **Using True Naïve Bayesian Classification Precision: .973 Recall: .993 F-Measure: .983
Precision and Recall Precision is the fraction of retrieved documents that are relevant Recall is the fraction of relevant documents that are retrieved
Confusion Matrix
NLP Experience @CDC ‘RiskBot’ POCs MSM online text/profile assessment project i2b2 (Informatics for Integrating Biology & the Bedside) medical discharge summary classification challenge MySpace data elements for public health message tailoring
i2b2 NIH-funded National Center for Biomedical Computing based at Partners HealthCare System i2b2 issues ‘challenges’ to correctly classify health records based on conditions and co-morbidities and invites various institutions/teams to compete Results shown on next few slides are derived from training set data
Statistical Analysis-Textual Judgment
Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro CAD 0.3911 0.4065 0.4522 0.2946 CHF 0.5153 0.4915 0.3811 0.411 Depression 0.6154 0.5619 0.6234 0.5261 Diabetes 0.4105 0.3894 0.4943 0.2865 GERD 0.2268 0.3095 0.4719 0.1302 Gout 0.2349 0.3621 0.3318 0.1915 Hypertension 0.4883 0.4025 0.5652 0.3566 Hypertriglyceridemia 0.5789 0.5093 0.5946 0.3975 OA 0.6181 0.5644 0.6178 0.5349 Obesity 0.2863 0.3942 0.5708 0.2092 OSA 0.1525 0.3145 0.5075 0.1232 PVD 0.7599 0.5958 0.6429 0.6056 Venous Insufficiency 0.739 0.513 0.5884 0.4671 System Textual 0.462846 0.447277 0.5263 0.348769
Rule Based NLP System Looks for keywords associated with each morbidity Tries to identify the assertion type in which each keyword appears Assertion Types: positive, negative, questionable Text preprocessing Primarily removes text not relevant to patient’s current condition Documents classified using scoring algorithm
Classification Rule Based system Corpus: i2b2 Obesity Challenge
Positive weighted Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro Obesity 0.9767 0.7426 0.4924 Depression 0.9835 0.9758 0.9824 0.979 Hypertri. 0.9959 0.9979 0.9167 0.9535 Gallstones 0.9712 0.6004 0.6064 0.6033 OSA 0.9904 0.9448 0.655 0.671 Asthma 0.9973 0.9947 0.9992 0.9969 CAD 0.9404 0.7715 0.8017 0.7857 PVD 0.9877 0.978 0.9764 0.9772 Gout 0.9918 0.8767 0.8217 0.8452 Diabetes 0.9752 0.8636 0.8155 0.8369 CHF 0.9417 0.7698 0.8746 0.8038 Ven. Insuf. 0.9863 0.8504 0.9467 0.8923 GERD 0.9696 0.6233 0.8848 0.67 OA 0.9876 0.9664 0.9891 0.9773 Hyperchol. 0.9821 0.9906 0.9349 0.959 Hyperten. 0.9575 0.9143 0.85 0.8796 System Textual 0.7391 0.7969 0.765
Questionable weighted Q-WEIGHTED Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro Obesity 0.9712 0.4926 0.4891 0.4907 Depression 0.9835 0.9758 0.9824 0.979 Hypertriglyceridemia 0.9959 0.9979 0.9167 0.9535 Gallstones 0.9671 0.5462 0.5995 0.5655 OSA 0.989 0.8997 0.7104 0.6753 Asthma 0.9973 0.9947 0.9992 0.9969 CAD 0.9334 0.7283 0.8337 0.769 PVD 0.9863 0.9778 0.9715 0.9746 Gout 0.9918 0.8767 0.8217 0.8452 Diabetes 0.9738 0.8279 0.8502 0.8363 CHF 0.9361 0.7419 0.8411 0.7698 Venous Insufficiency 0.8504 0.9467 0.8923 GERD 0.9696 0.6233 0.8848 0.67 OA 0.9849 0.9698 0.9804 0.9749 Hypercholesterolemia 0.9807 0.8656 0.9341 0.8752 Hypertension 0.9479 0.8532 0.8456 0.8486 System Textual 0.9747 0.6954 0.8171 0.7404
Positive Weighted Assertion Types in Tie Y Q N Tie Breaker Judgment *
What May Account for the Difference ? Paucity of the data Pre-processing Negation context and assertions Inherent properties of a medical text Degree of differentiation Locality of information
Highly Differentiated Text Email A Email B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ Reviewed your presentation online pharma Jai called to say.. viagra
Text with Low Differentiation Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ CAD CAD
Text with Low Differentiation Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ …father was diagnosed with CAD …patient has CAD indeterminate positive for CAD
Text with Low Differentiation Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ …no family history of CAD …patient has CAD indeterminate positive for CAD
NLP in Public Health Major Problems Complexity of the underlying domains: Medicine, Microbiology, Social Sciences etc. Complexity of language Creating lexicons and rules: requires linguistic and computational expertise Idealized Language v. Language in Use Ambiguity Lack of large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to adapt to specific domains
Possible Solutions…. Focus on making limited domain specific systems Combine rule based & statistical approach as needed Focus initially on limited, “simple” tasks Focus on real language use under realistic conditions Progress made by making working systems and evaluating them rigorously
Examples of NLP in Public Health Decision Augmentation Service (CDC) Public Health Search engine (CNLP) Validation of clinical data submitted to Bio-Sense using NLP (Mayo Clinic) Information retrieval, text mining and knowledge synthesis for human genome epidemiology (National Office of Public Health Genomics)
NLP in Public Health Pneumonia detection from free text radiological reports (Mayo Clinic) Public health situational awareness tools for information fusion (BIC) Chief complaint based syndromic surveillance (BioSense) Vaccine adverse event detection (CDC)
Questions and Answers