Download presentation
Presentation is loading. Please wait.
Published byAdam Dean Modified over 9 years ago
1
Information Extraction for Clinical Data Mining: A Mammography Case Study H. Nassif, R. Woods, E. Burnside, M. Ayvaci, J. Shavlik and D. Page University of Wisconsin – Madison, USA
2
The American Cancer Society, Cancer Facts & Figures 2009.
3
Impression (free text) Mammogram Radiologist Structured Database Predictive Model Benign Malignant
4
Task Formulation Given: - Free text radiology report - Standard lexicon (BI-RADS) Do: - Extract lexicon concepts from text - Populate a structured database Why: - Automate information extraction - Manual extraction is labor intensive - Consistency checks
5
BI-RADS Lexicon Concepts
6
Lobular ShapeOval ShapeObscured Margin… Report 1010… Report 2101… …………… Example In the right breast, an approximately 1.0 cm mass is identified in the right upper slightly inner breast. This mass is noncalcified and partially obscured and lobulated in appearance. Concepts
8
Syntax Analyzer Tokenize sentences Discard punctuation Keep stop words Stem words
10
Information from Lexicon Lexicon specifies synonyms: Eg: Equal density, Isodense Lexicon allows for ambiguous wording: TextConcept indistinct margin indistinct calcificationamorphous calcification indistinct imagenot a concept
12
Experts Provide domain specific information – Synonyms: Oval, Ovoid – Acronyms, abbreviations – Domain idiosyncrasies Interact with and modify semantic rules
14
Concept Finder Context Free Grammar rules Extract concepts from text Rule formation: – Initial rules based on lexicon – Rules refined by experts
15
Rule Generation Example 1 Aim: Regional Distribution Concept Lexicon specifies the word “regional” Initial rule: presence of the word “regional” Run on training set, experts see results Many false positives: – “regional medical center”, “regional hospital” Rule refined by experts: – “regional.* !(medical|hospital)”
16
Rule Generation Example 2 Aim: Skin Thickening Concept Lexicon specifies “skin thickening” Try “skin” and “thickening” in same sentence – “skin retraction and thickening” – “thickening of the overlying skin” – “A BB placed on the skin overlying a palpable focal area of thickening in the upper outer right breast” Experts suggest “skin” and “thickening” in close proximity
17
Scope Scope: distance between two words Start with a large scope: – assess number of true and false positives Move to smaller scopes: – assess number of false negatives Check precision and recall estimates Experts decide on the best distance
19
Negation Detector Negation triggers (Mutalik 01, Gindl 08): – “not”, if not preceded by “where” – “no” – “without” Precedes or appears within the subsentence Establish negation scope “without evidence of suspicious cluster of microcalcifications”
20
Negation Deactivation “there is no change in the rounded density” Negation-deactivation triggers: – Change – All – Correlation – Differ – Other
21
Multiple Latent Concepts Mammography reports: – Radiology concepts – Ultrasound concepts – MRI concepts… “round hypoechoic mass” – Concept should not be extracted Provide an ultrasound lexicon: – Algorithm handles multiple latent concepts
22
Experiment Training set: 146,198 reports, unlabeled Testing set: 100 reports, labeled by radiologist Algorithm differs over 43 concept occurrences – Correctly extracts 28 Lobular ShapeOval ShapeObscured Margin… Report 1010… Report 2101… ……………
23
Contingency Table on Test Set Automated v/s Manual Feature Extraction Actual Concept presentConcept absent PredictedConcept present 211 (198)5 (5) Concept absent 10 (23)4074 (4074)
24
Statistics Ground truth? – Features that both methods agree on – Experts re-label diverging cases Probabilistic interpretation of contingency table (Goutte 05) Computational method is statistically superior to the manual method (p=0.024)
25
Conclusion Automated extraction that matches experts Novel contributions: – Negation-deactivation triggers – Handling multiple latent concepts Improves our current breast cancer classifier (work in progress)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.