Classifying Supplement Use Status in Clinical Notes Yadan Fan, BS1, Lu He, BS2 Serguei V.S. Pakhomov, PhD1,3 Genevieve B. Melton, MD, PhD1,4 Rui Zhang, PhD1,4 1Institute for Health Informatics 2Department of Computer Science 3College of Pharmacy 4Department of Surgery University of Minnesota, Minneapolis, MN
Introduction Approximately 68% of the Americans take dietary supplements1 Adverse reactions2 From in-vivo or in-vitro studies, case report, post-market surveillance Under-reported Electronic Health Records (EHRs) Reliable patient information Supplements term coverage Great amount of information about supplements use is embedded in clinical notes 1. www.crnusa.org/CRNconsumersurvey/2015. 2. Geller, Andrew I., et al. N Engl J Med 2015.373 (2015): 1531-1540. Zhang, Rui, et al. AMIA Annual Symposium Proceedings. Vol. 2015.
Objective To automatically classify use status of dietary supplements by applying text mining methods 25 supplements Alfalfa Ginkgo Bilberry Dandelion Kava Echinacea Ginseng Biotin Flax seed Lecithin Fish oil Melatonin Black cohosh Folic acid Milk thistle Garlic St. John’s Wort Coenzyme Q10 Glucosamine Saw palmetto Ginger Vitamin E Cranberry Glutamine Tumeric
7 feature sets with 5 classification algorithms Overview Method clinical data repository notes Training Set (1000, ~77%) 10 supplements 100 for each supplement preprocessing Select the model Random Selection 10 supplements: alfalfa, echinacea, fish oil, garlic, ginger, ginkgo, ginseng, melatonin, St. John’s Wort, Vitamin E 7 feature sets with 5 classification algorithms 1300 sentences 15 supplements: bilberry, biotin, black cohosh, coenzyme Q10, cranberry, dandelion, flax seed, folic acid, glucosamine, glutamine, kava, lecithin, milk thistle, saw palmetto, and tumeric Test Set (300, ~23%) Annotation 15 supplements 20 for each supplement preprocessing Evaluate the model Gold Standard Continuing Discontinued Started Unclassified Performance evaluation
Data Collection Notes retrieval Data sets key words searching lexical variations “ginkgo”, “gingko”, “ginko”, “ginkoba” Data sets Training set Compare 7 feature sets with 5 classification algorithms Test set Evaluate the optimal model in the training data
Development of Gold Standard Annotation guideline Adapted from previous study* investigating drug use status Minor changes Apply on 20 randomly selected sentences Disagreement resolved by discussion *Pakhomov, Serguei V. et al. Proceedings of the AMIA Symposium. American Medical Informatics Association, 2002.
Annotation Guideline Use Status Definition Examples Continuing (C) Patients continue on current supplements She continued on herbal supplements including echinacea. Increase the dose of garlic. Discontinued (D) Discontinuation the supplements Stopped taking her garlic two weeks ago. Pt will hold taking ginseng. Started (S) Initiation of new supplements or restarting supplements Start ginkgo to help memory. Begin melatonin 10mg 1 hour before bedtime Unclassified (U) Do no offer ample information about the use status, such as recommendation, education, negation Advised over-the-counter melatonin. Denies using st johns wort.
Development of Gold Standard Annotation guideline Inter-annotator agreement 100 randomly selected sentences Cohen’s Kappa score: 0.93 Percentage agreement: 95% Equally split and annotated split and annotated the dataset among two reviewers
Gold Standard
Feature Set Type 0 – raw unigrams Bag-of-words representation method
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams lexical variation generation (LVG) tool E.g.: “takes”, “taken”, “taking”, “took”: “take”
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Semantic cues: She has increased alfalfa tables: Continuing Stopped taking her garlic two weeks ago : Discontinued Pt started taking ginkgo biloba : Started Melatonin is recommended for sleep aid: Unclassified A list of indicator words Pakhomov, Serguei V. et al. Proceedings of the AMIA Symposium. American Medical Informatics Association, 2002.
Indicator Keywords start start, starts, started, starting restart restart, restarts, restarted, restarting resume resume, resumed, resumes, resuming initiate initiate, initiates, initiated, initiating increase increase, increases, increased, increasing decrease decrease, decreases, decreased, decreasing reduce reduce, reduces, reduced, reducing lower lower, lowers, lowered, lowering take take, takes, took, taking, taken consume consume, consumes, consumed, consuming stop stop, stops, stopped, stopping hold hold, holds, held, holding advise advise, advises, advised, advising avoid avoid, avoids, avoided, avoiding deny deny, denies, denied, denying decline decline, declines, declined, declining refuse Refuse, refuses, refused, refusing neg no, not, never
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Indicator is close to supplement mention He continues on Coumadin and also has recently started ginseng as he is concerned about the fatigue he will have during chemotherapy The optimal window size is 4 S
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Type 5 – normalized unigrams + bigrams + indicator with distance
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Type 5 – normalized unigrams + bigrams + indicator with distance Type 6 – nouns + verbs + adverbs Verbs hold more information (indicators) Stanford parser Nouns (NN/NNS/NNP/NNPS) Verbs (VB/VBG/VBP/VBZ/VBD/VBN) Some adverbs (RB): “no”, “not”, “never”
Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Type 5 – normalized unigrams + bigrams + indicator with distance Type 6 – nouns + verbs + adverbs
Training and Evaluation Algorithms Support Vector Machine (SVM) Maximum Entropy Naive Bayes Decision Tree Random Forest Evaluation 10-fold cross validation Precision, Recall, and F-measure
Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure
Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure
Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure
Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure
Test Data Performance SVM with Type 5 Use Status Precision Recall F-measure Continuing (C) 0.869 0.946 0.906 Discontinued (D) 0.933 0.894 0.913 Started (S) 0.896 0.932 0.914 Unclassified (U) 0.786 0.657 0.715
Discussion SVM outperformed other algorithms Performance of classifiers Lexical normalizations Bigrams Indication words Distance (window size) Best model SVM Normalized unigrams +bigrams+ indicator words with distance
Limitation & Future Work Small corpus Incorporating more data to enlarge the dataset Abbreviations, acronyms and typos Incorporating existing abbreviation disambiguation methods
Conclusions The training model built from 10 supplements performed well on the test data on 15 supplements Applying text mining methods on clinical notes can extract supplements use information Knowing supplements use among patients can be further applied in clinical research
Acknowledgements Advisor: Rui Zhang, PhD NIH/National Center for Complementary and Integrative Health (NCCIH) grant (R01AT009457) (Zhang) University of Minnesota Grant-In-Aid award (Zhang) Agency for Healthcare Research & Quality grant (R01HS022085) (Melton) National Center for Advancing Translational Sciences of the National Institutes of Health (UL1TR000114) (Blazar)
Thank you! fanxx421@umn.edu