Classifying Supplement Use Status in Clinical Notes

Slides:



Advertisements
Similar presentations
Deema Abdal Hafeth MSc student by research School of Computer Science, University of Lincoln Dr Amr Ahmed Supervisor Dr David Cobham supervisor.
Advertisements

Farag Saad i-KNOW 2014 Graz- Austria,
Herbal medicines and blood clotting in the Perioperative settings Dr Gordon Ogweno Department of Medical Physiology Kenyatta University.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Scalable Text Mining with Sparse Generative Models
Flash talk by: Aditi Garg, Xiaoran Wang Authors: Sarah Rastkar, Gail C. Murphy and Gabriel Murray.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
DIETARY SUPPLEMENTS DO WE NEED THEM? Juanita Kerber KH499 Bachelor’s Capstone in Health and Wellness.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Integrative Medicine and Phytotherapy Chapter 20.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Chapter 23: Probabilistic Language Models April 13, 2004.
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Extracting Hidden Components from Text Reviews for Restaurant Evaluation Juanita Ordonez Data Mining Final Project Instructor: Dr Shahriar Hossain Computer.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
What medicines came from these plants and fungi? Willow Wintergreen berry Rye mold Bread mold Madagascar periwinkle Pacific yew Belladonna Foxglove Wormwood.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation.
Language Identification and Part-of-Speech Tagging
Constructing a Predictor to Identify Drug and Adverse Event Pairs
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Rui Zhang, Ph.D.2,5, Genevieve B. Melton-Meaux, M.D., Ph.D.2,5
Representation of Occupation Information in Clinical Texts: An Analysis of Free-Text Clinical Documentation in Multiple Sources Elizabeth A. Lindemann,
An Empirical Comparison of Supervised Learning Algorithms
Sentiment analysis algorithms and applications: A survey
Dietary and Herbal Supplements
Jingcheng Du, B.S., Jun Xu, Ph.D., Hsingyi Song, MPH, Cui Tao, Ph.D.
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Development and Delivery of Medicinal Herb On-Line Course
CRF &SVM in Medication Extraction
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Saisai Gong, Wei Hu, Yuzhong Qu
Erasmus University Rotterdam
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
张昊.
“The Good & Bad in Dietary Supplements”
Hong Kang, PhD (Presenter) Zhiguo Yu, PhD Yang Gong, MD, PhD
Natural Language Processing of Knee MRI Reports
Extra Tree Classifier-WS3 Bagging Classifier-WS3
A Clinical trial awareness tool
An Inteligent System to Diabetes Prediction
Statistical NLP: Lecture 9
Albert Park, Ph.D. Background: Public Health Informatics, Consumer Health Informatics, Data and Computational Science, Human-Computer.
iSRD Spam Review Detection with Imbalanced Data Distributions
Review-Level Aspect-Based Sentiment Analysis Using an Ontology
Automatic Detection of Causal Relations for Question Answering
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Clinical Research Informatics: From Bedside to Silicon and Back
Predicting Loan Defaults
By Hossein Hematialam and Wlodek Zadrozny Presented by
Statistical NLP : Lecture 9 Word Sense Disambiguation
Stance Classification of Ideological Debates
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

Classifying Supplement Use Status in Clinical Notes Yadan Fan, BS1, Lu He, BS2 Serguei V.S. Pakhomov, PhD1,3 Genevieve B. Melton, MD, PhD1,4 Rui Zhang, PhD1,4 1Institute for Health Informatics 2Department of Computer Science 3College of Pharmacy 4Department of Surgery University of Minnesota, Minneapolis, MN

Introduction Approximately 68% of the Americans take dietary supplements1 Adverse reactions2 From in-vivo or in-vitro studies, case report, post-market surveillance Under-reported Electronic Health Records (EHRs) Reliable patient information Supplements term coverage Great amount of information about supplements use is embedded in clinical notes 1. www.crnusa.org/CRNconsumersurvey/2015. 2. Geller, Andrew I., et al. N Engl J Med 2015.373 (2015): 1531-1540. Zhang, Rui, et al. AMIA Annual Symposium Proceedings. Vol. 2015.

Objective To automatically classify use status of dietary supplements by applying text mining methods 25 supplements Alfalfa Ginkgo Bilberry Dandelion Kava Echinacea Ginseng Biotin Flax seed Lecithin Fish oil Melatonin Black cohosh Folic acid Milk thistle Garlic St. John’s Wort Coenzyme Q10 Glucosamine Saw palmetto Ginger Vitamin E Cranberry Glutamine Tumeric

7 feature sets with 5 classification algorithms Overview Method clinical data repository notes Training Set (1000, ~77%) 10 supplements 100 for each supplement preprocessing Select the model Random Selection 10 supplements: alfalfa, echinacea, fish oil, garlic, ginger, ginkgo, ginseng, melatonin, St. John’s Wort, Vitamin E 7 feature sets with 5 classification algorithms 1300 sentences 15 supplements: bilberry, biotin, black cohosh, coenzyme Q10, cranberry, dandelion, flax seed, folic acid, glucosamine, glutamine, kava, lecithin, milk thistle, saw palmetto, and tumeric Test Set (300, ~23%) Annotation 15 supplements 20 for each supplement preprocessing Evaluate the model Gold Standard Continuing Discontinued Started Unclassified Performance evaluation

Data Collection Notes retrieval Data sets key words searching lexical variations “ginkgo”, “gingko”, “ginko”, “ginkoba” Data sets Training set Compare 7 feature sets with 5 classification algorithms Test set Evaluate the optimal model in the training data  

Development of Gold Standard Annotation guideline Adapted from previous study* investigating drug use status Minor changes Apply on 20 randomly selected sentences Disagreement resolved by discussion *Pakhomov, Serguei V. et al. Proceedings of the AMIA Symposium. American Medical Informatics Association, 2002.

Annotation Guideline Use Status Definition Examples Continuing (C) Patients continue on current supplements She continued on herbal supplements including echinacea. Increase the dose of garlic. Discontinued (D) Discontinuation the supplements Stopped taking her garlic two weeks ago. Pt will hold taking ginseng. Started (S) Initiation of new supplements or restarting supplements Start ginkgo to help memory. Begin melatonin 10mg 1 hour before bedtime Unclassified (U) Do no offer ample information about the use status, such as recommendation, education, negation Advised over-the-counter melatonin. Denies using st johns wort.

Development of Gold Standard Annotation guideline Inter-annotator agreement 100 randomly selected sentences Cohen’s Kappa score: 0.93 Percentage agreement: 95% Equally split and annotated split and annotated the dataset among two reviewers

Gold Standard

Feature Set Type 0 – raw unigrams Bag-of-words representation method

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams lexical variation generation (LVG) tool E.g.: “takes”, “taken”, “taking”, “took”: “take”

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Semantic cues: She has increased alfalfa tables: Continuing Stopped taking her garlic two weeks ago : Discontinued Pt started taking ginkgo biloba : Started Melatonin is recommended for sleep aid: Unclassified A list of indicator words Pakhomov, Serguei V. et al. Proceedings of the AMIA Symposium. American Medical Informatics Association, 2002.

Indicator Keywords start start, starts, started, starting restart restart, restarts, restarted, restarting resume resume, resumed, resumes, resuming initiate initiate, initiates, initiated, initiating increase increase, increases, increased, increasing decrease decrease, decreases, decreased, decreasing reduce reduce, reduces, reduced, reducing lower lower, lowers, lowered, lowering take take, takes, took, taking, taken consume consume, consumes, consumed, consuming stop stop, stops, stopped, stopping hold hold, holds, held, holding advise advise, advises, advised, advising avoid avoid, avoids, avoided, avoiding deny deny, denies, denied, denying decline decline, declines, declined, declining refuse Refuse, refuses, refused, refusing neg no, not, never

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Indicator is close to supplement mention He continues on Coumadin and also has recently started ginseng as he is concerned about the fatigue he will have during chemotherapy The optimal window size is 4 S

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Type 5 – normalized unigrams + bigrams + indicator with distance

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Type 5 – normalized unigrams + bigrams + indicator with distance Type 6 – nouns + verbs + adverbs Verbs hold more information (indicators) Stanford parser Nouns (NN/NNS/NNP/NNPS) Verbs (VB/VBG/VBP/VBZ/VBD/VBN) Some adverbs (RB): “no”, “not”, “never”

Feature Set Type 0 – raw unigrams Type 1 – normalized unigrams Type 2 – normalized unigrams + bigrams Type 3 – indicator words only Type 4 – normalized unigrams + indicators with distance Type 5 – normalized unigrams + bigrams + indicator with distance Type 6 – nouns + verbs + adverbs

Training and Evaluation Algorithms Support Vector Machine (SVM) Maximum Entropy Naive Bayes Decision Tree Random Forest Evaluation 10-fold cross validation Precision, Recall, and F-measure

Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure

Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure

Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure

Training Data Performance Classifier SVM Maximum Entropy Naïve Bayes Decision Tree Random Forest *P *R *F P R F Type 0 0.771 0.751 0.748 0.778 0.762 0.760 0.757 0.726 0.721 0.738 0.718 0.717 0.789 0.763 0.753 Type 1 0.799 0.772 0.735 0.734 0.659 0.639 0.596 0.792 0.759 0.743 0.791 0.767 0.756 Type 2 0.839 0.838 0.813 0.794 0.786 0.635 0.579 0.497 0.804 0.790 0.785 0.747 Type 3 0.784 0.783 0.750 0.729 0.711 0.788 0.818 0.815 0.812 Type 4 0.798 0.793 0.761 0.678 0.612 0.541 0.745 0.816 Type 5 0.845 0.844 0.823 0.806 0.800 0.653 0.584 0.499 0.810 0.808 Type 6 0.829 0.828 0.749 0.681 0.647 0.613 0.787 *P: precision, R: recall, F: F-measure

Test Data Performance SVM with Type 5 Use Status Precision Recall F-measure Continuing (C) 0.869 0.946 0.906 Discontinued (D) 0.933 0.894 0.913 Started (S) 0.896 0.932 0.914 Unclassified (U) 0.786 0.657 0.715

Discussion SVM outperformed other algorithms Performance of classifiers Lexical normalizations Bigrams Indication words Distance (window size) Best model SVM Normalized unigrams +bigrams+ indicator words with distance

Limitation & Future Work Small corpus Incorporating more data to enlarge the dataset Abbreviations, acronyms and typos Incorporating existing abbreviation disambiguation methods

Conclusions The training model built from 10 supplements performed well on the test data on 15 supplements Applying text mining methods on clinical notes can extract supplements use information Knowing supplements use among patients can be further applied in clinical research  

Acknowledgements Advisor: Rui Zhang, PhD NIH/National Center for Complementary and Integrative Health (NCCIH) grant (R01AT009457) (Zhang) University of Minnesota Grant-In-Aid award (Zhang) Agency for Healthcare Research & Quality grant (R01HS022085) (Melton) National Center for Advancing Translational Sciences of the National Institutes of Health (UL1TR000114) (Blazar)

Thank you! fanxx421@umn.edu