Knowledge Discovery and Data Mining to Assist Natural Language Understanding (Adam Wilcox, M.A., George Hripcsak, M.D. Department of Medical Informatics, Columbia University, New York, NY.,1998) Presented by Chaveevan Pechsiri
outline Objective Methodologies Results Discussion Suggestion
Objective Generate queries and rules Interpret the output from MedLEE processor at Columbia-Presbyterian Medical Center Techniques: NLP Data mining: Classification by using C5.0 Chest radiograph reports + clinic encounters
Methodologies NLP Findings with modifiers Generate a vector report Flattening = finding + modifier Coding = flattening + modifier value Classification The decision tree C5.0(ID3)
NLP Words & pharses recognition Std. term generation Classify terms to semantic catagories Parse sequences of semantic categories to structures Narrative report MedLEE processor Findings with modifiers Clinical dictionary Grammar rules dictionary congestive heart failure, heart failure, CHF left pleural effusion…… …….. new pleural effusion
NLP Pulmonary vascular congestion certainty: high degree: low Pleural effusion region: left status: new Congestive change certainty: moderate degree: low “Probable mild pulmonary vascular congestion with new left pleural effusion, question mild congestive changes Processor output (3Findings with modifiers) Narrative report NLP MedLEE
Coding finding-modifier pair Pulmonary vascular congestion certainty: high degree: low Pleural effusion region: left status: new Congestive change certainty: moderate degree: low Processor output pulmonary vascular congestion= present pulmonary vascular congestion: certainty= high pulmonary vascular congestion : degree= low pleural effusion= present pleural effusion: region= left pleural effusion: status= new congestive change= present congestive change: certainty= moderate congestive change: degree= low Finding vector report
Diagnosing Hypothyroidism Attribute Assay 1 Assay 2 Assay age sex F M M on thyroxine t f f query on thyroxine f f f on antithyroid medication f f f sick f f f pregnant t N/A N/A thyroid surgery f f f I131 treatment f f f query hypothyroid f f t query hyperthyroid t f f lithium f f f tumor f f f goitre f f f hypopituitary f f f psych f f f TSH T TT T4U FTI referral source other SVI other diagnosis negative primary compensated hypothyr hypothyr C5.0 Decision table
C5.0 If-then rules Rule 1: (31, lift 42.7) thyroid surgery = f TSH > 6 TT4 <= 37 -> class primary [0.970] Rule 2: (63/6, lift 39.3) TSH > 6 FTI <= 65 -> class primary [0.892] Rule 3: (270/116, lift 10.3) TSH > 6 -> class compensated [0.570] Rule 4: (2225/2, lift 1.1) TSH <= 6 -> class negative [0.999] Rule 5: (296, lift 1.1) on thyroxine = t FTI > 65 -> class negative [0.997]
Error Measurement TP=True Positive FN=False Negative TN=True Negative FP=False Negative
results
Discussion The automated method did not reach the level of the physicians High noise in training set The training set is too small to properly train the system to detect positive findings. The training set with ICD9 was not accurate enough to create rules the ambiguities cause C5.0 error, or lack of strong specificity
Suggestion Need a large training set to generate a sensitive classifier Ontology should be implemented to clinical dictionary Need to modify the ICD9 code The knowledge discovery should be the generalized knowledge Try some other classifiers: Bayesian belief networks, the Backpropagation neural network, the sequential covering algorithm