Download presentation
Presentation is loading. Please wait.
1
NLP in Public Health: What May Work
Ninad Mishra Dave Cummo Jim Arnzen
2
Monkeys and Bananas We gave the monkeys the bananas because they were hungry We gave the monkeys the bananas because they were over-ripe The sentences We gave the monkeys the bananas because they were hungry and We gave the monkeys the bananas because they were over-ripe have the same surface grammatical structure. However, the pronoun they refers to monkeys in one sentence and bananas in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas Source: Unknown
3
NLP Applications Question answering (information retrieval)
Text/Document classification Text Mining Find something that was not known before Information extraction Extract useful information from public health grey literature Source: CNLP
4
Classification with NLP
Statistical Methods Bayesian Artificial Neural Networks Support Vector Machines Rule Based Classification System Keyword approach Negation strategies
5
Naïve Bayesian Classification
Statistical classification method Uses supervised learning Requires a human-classified training set Assumes that the occurrence of each individual word is independent of every other word
6
Naïve Bayesian Classification
calculations Training corpus Spam Non-spam Testing corpus spam and non-spam Word frequency statistics stored Spam Non-spam Spam identified Valid non-spam identified (SME-classified) (unclassified) Spam Filtering Trial
7
Preliminary Results Precision: .864 Recall: .810 F-Measure: .836
** Using Classifier4j Package – Modified Naïve Bayesian Precision: .864 Recall: .810 F-Measure: .836
8
Preliminary Results Precision: .973 Recall: .993 F-Measure: .983
**Using True Naïve Bayesian Classification Precision: .973 Recall: .993 F-Measure: .983
9
Precision and Recall Precision is the fraction of retrieved documents that are relevant Recall is the fraction of relevant documents that are retrieved
10
Confusion Matrix
11
NLP Experience @CDC ‘RiskBot’ POCs
MSM online text/profile assessment project i2b2 (Informatics for Integrating Biology & the Bedside) medical discharge summary classification challenge MySpace data elements for public health message tailoring
12
i2b2 NIH-funded National Center for Biomedical Computing based at Partners HealthCare System i2b2 issues ‘challenges’ to correctly classify health records based on conditions and co-morbidities and invites various institutions/teams to compete Results shown on next few slides are derived from training set data
13
Statistical Analysis-Textual Judgment
14
Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro CAD 0.3911 0.4065 0.4522 0.2946 CHF 0.5153 0.4915 0.3811 0.411 Depression 0.6154 0.5619 0.6234 0.5261 Diabetes 0.4105 0.3894 0.4943 0.2865 GERD 0.2268 0.3095 0.4719 0.1302 Gout 0.2349 0.3621 0.3318 0.1915 Hypertension 0.4883 0.4025 0.5652 0.3566 Hypertriglyceridemia 0.5789 0.5093 0.5946 0.3975 OA 0.6181 0.5644 0.6178 0.5349 Obesity 0.2863 0.3942 0.5708 0.2092 OSA 0.1525 0.3145 0.5075 0.1232 PVD 0.7599 0.5958 0.6429 0.6056 Venous Insufficiency 0.739 0.513 0.5884 0.4671 System Textual 0.5263
15
Rule Based NLP System Looks for keywords associated with each morbidity Tries to identify the assertion type in which each keyword appears Assertion Types: positive, negative, questionable Text preprocessing Primarily removes text not relevant to patient’s current condition Documents classified using scoring algorithm
16
Classification Rule Based system
Corpus: i2b2 Obesity Challenge
17
Positive weighted Disease P-Micro P-Macro R-Micro R-Macro F-Micro
F-Macro Obesity 0.9767 0.7426 0.4924 Depression 0.9835 0.9758 0.9824 0.979 Hypertri. 0.9959 0.9979 0.9167 0.9535 Gallstones 0.9712 0.6004 0.6064 0.6033 OSA 0.9904 0.9448 0.655 0.671 Asthma 0.9973 0.9947 0.9992 0.9969 CAD 0.9404 0.7715 0.8017 0.7857 PVD 0.9877 0.978 0.9764 0.9772 Gout 0.9918 0.8767 0.8217 0.8452 Diabetes 0.9752 0.8636 0.8155 0.8369 CHF 0.9417 0.7698 0.8746 0.8038 Ven. Insuf. 0.9863 0.8504 0.9467 0.8923 GERD 0.9696 0.6233 0.8848 0.67 OA 0.9876 0.9664 0.9891 0.9773 Hyperchol. 0.9821 0.9906 0.9349 0.959 Hyperten. 0.9575 0.9143 0.85 0.8796 System Textual 0.7391 0.7969 0.765
18
Questionable weighted
Q-WEIGHTED Disease P-Micro P-Macro R-Micro R-Macro F-Micro F-Macro Obesity 0.9712 0.4926 0.4891 0.4907 Depression 0.9835 0.9758 0.9824 0.979 Hypertriglyceridemia 0.9959 0.9979 0.9167 0.9535 Gallstones 0.9671 0.5462 0.5995 0.5655 OSA 0.989 0.8997 0.7104 0.6753 Asthma 0.9973 0.9947 0.9992 0.9969 CAD 0.9334 0.7283 0.8337 0.769 PVD 0.9863 0.9778 0.9715 0.9746 Gout 0.9918 0.8767 0.8217 0.8452 Diabetes 0.9738 0.8279 0.8502 0.8363 CHF 0.9361 0.7419 0.8411 0.7698 Venous Insufficiency 0.8504 0.9467 0.8923 GERD 0.9696 0.6233 0.8848 0.67 OA 0.9849 0.9698 0.9804 0.9749 Hypercholesterolemia 0.9807 0.8656 0.9341 0.8752 Hypertension 0.9479 0.8532 0.8456 0.8486 System Textual 0.9747 0.6954 0.8171 0.7404
19
Positive Weighted Assertion Types in Tie Y Q N Tie Breaker Judgment *
20
What May Account for the Difference ?
Paucity of the data Pre-processing Negation context and assertions Inherent properties of a medical text Degree of differentiation Locality of information
21
Highly Differentiated Text
A B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ Reviewed your presentation online pharma Jai called to say.. viagra
22
Text with Low Differentiation
Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ CAD CAD
23
Text with Low Differentiation
Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ …father was diagnosed with CAD …patient has CAD indeterminate positive for CAD
24
Text with Low Differentiation
Medical Discharge Summary A Medical Discharge Summary B ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ ~~~~~~~~~~~~ ~~~~~~ ~~~~~~ ~~~~~~~~~~~~ ~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~~~~~~~~~ ~~~ ~~~~~~~~~ ~~~~ ~~~~~~~~ ~~~~~~~~~~ ~~ ~ ~~~~~~~ ~~~~ …no family history of CAD …patient has CAD indeterminate positive for CAD
25
NLP in Public Health Major Problems
Complexity of the underlying domains: Medicine, Microbiology, Social Sciences etc. Complexity of language Creating lexicons and rules: requires linguistic and computational expertise Idealized Language v. Language in Use Ambiguity Lack of large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to adapt to specific domains
26
Possible Solutions…. Focus on making limited domain specific systems
Combine rule based & statistical approach as needed Focus initially on limited, “simple” tasks Focus on real language use under realistic conditions Progress made by making working systems and evaluating them rigorously
27
Examples of NLP in Public Health
Decision Augmentation Service (CDC) Public Health Search engine (CNLP) Validation of clinical data submitted to Bio-Sense using NLP (Mayo Clinic) Information retrieval, text mining and knowledge synthesis for human genome epidemiology (National Office of Public Health Genomics)
28
NLP in Public Health Pneumonia detection from free text radiological reports (Mayo Clinic) Public health situational awareness tools for information fusion (BIC) Chief complaint based syndromic surveillance (BioSense) Vaccine adverse event detection (CDC)
29
Questions and Answers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.