Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital.

Slides:



Advertisements
Similar presentations
Problem List Definitions & Conventions. Major Functions Table of Contents for the Health Record Table of Contents for the Health Record Information Bulletin.
Advertisements

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Improved TF-IDF Ranker
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Heterogeneous Forests of Decision Trees Krzysztof Grąbczewski & Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Torun, Poland.
Retrieval of Similar Electronic Health Records using UMLS Concept Graphs Laura Plaza and Alberto Díaz Universidad Complutense de Madrid.
Topic 6: Introduction to Hypothesis Testing
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
Almost Random Projection Machine with Margin Maximization and Kernel Features Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,
Improving Information Retrieval in MEDLINE by Modulating MeSH Term Weights Kwangcheol Shin, Sang-Yong Han School of CSE, Chung-Ang Univ. Seoul, Korea NLDB.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Development of a Pediatric Text-Corpus for Part-of-Speech Tagging John Pestian 1, Lukasz Itert 1,2, and Włodzisław Duch 2,3 1 BioMedical Informatics, 3333.
Presented by Zeehasham Rasheed
Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,
Spotlight Case Treatment Challenges After Discharge.
IR Models: Review Vector Model and Probabilistic.
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
APPLICATION : DIAGNOSTIC CODING 1 SIEMENS  Coding is the translation of diagnosis terms describing patients diagnosis or treatment into a coded number.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
Text Classification, Active/Interactive learning.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Classification Techniques: Bayesian Classification
Neurolinguistic Approach to Vector Representation of Medical Concepts
Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏 慈濟大學醫學資訊學系 2012/06/13.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Natural Language Processing Topics in Information Retrieval August, 2002.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Introduction to Health Informatics Leon Geffen MBChB MCFP(SA)
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Queensland University of Technology
Fever in infants: Evaluation by
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Efficient Ranking of Keyword Queries Using P-trees
Department of Informatics, Nicolaus Copernicus University, Toruń
NRS 410Competitive Success/tutorialrank.com
NRS 410 Education for Service-- tutorialrank.com.
Text Categorization Assigning documents to a fixed set of categories
Support Vector Neural Training
Information Retrieval and Web Design
Medical Document Categorization Using a Priori Knowledge
Term Frequency–Inverse Document Frequency
Presentation transcript:

Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital Research Foundation, Cincinnati, OH, USA 2 Department of Informatics, Nicolaus Copernicus University, Torun, Poland 3 School of Computer Engineering, Nanyang Technological University, Singapore ICANN 2005, Warsaw, Sept. 2005

Outline Goals & questions Goals & questions Medical data Medical data Data preparation Data preparation Model of similarity Model of similarity Computational experiments and results Computational experiments and results

Goals & Questions What are the key clinical descriptors for a given disease? What are the key clinical descriptors for a given disease? In what sense are the records describing patients with the same diseases similar? In what sense are the records describing patients with the same diseases similar? Can we capture expert’s intuition evaluating document’s similarity and diversity? Can we capture expert’s intuition evaluating document’s similarity and diversity? Include a priori knowledge in document categorization – important especially for rare disease. Include a priori knowledge in document categorization – important especially for rare disease. Use UMLS ontology and NLM lexical tools. Use UMLS ontology and NLM lexical tools.

Example of clinical summary discharges Jane is a 13yo WF who presented with CF bronchopneumonia. She has noticed increasing cough, greenish sputum production, and fatique since prior to 12/8/03. She had 2 febrile epsiodes, but denied any nausea, vomiting, diarrhea, or change in appetite. Upon admission she had no history of diabetic or liver complications. Her FEV1 was 73% 12/8 and she was treated with 2 z-paks, and on 12/29 FEV1 was 72% at which time she was started on Cipro. She noted no clinical improvement and was admitted for a 2 week IV treatment of Tobramycin and Meropenem.

Unified Medical Language System (UMLS) semantic types “Virus" causes "Disease or Syndrome" semantic relation semantic relation Other relations: “interacts with”, “contains”, “consists of”, “result of”, “related to”, … Other relations: “interacts with”, “contains”, “consists of”, “result of”, “related to”, … Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, … Other types: “Body location or region”, “Injury or Poisoning”, “Diagnostic procedure”, …

UMLS – Example (keyword: “virus”) Metathesaurus : Concept: Virus, CUI: C , Semantic Type: Virus Definition (1 of 3): “Group of minute infectious agents characterized by a lack of independent metabolism and by the ability to replicate only within living host cells; have capsid, may have DNA or RNA (not both)”. (CRISP Thesaurus) Synonyms: Virus, Vira Viridae Semantic Network: "Virus" causes "Disease or Syndrome"

Data Disease name Clinical Data Reference Data size [bytes] No. of records Average size [bytes] Pneumonia Asthma Epilepsy Anemia UTI JRA Cystic fibrosis Cerebral palsy Otitis media Gastroenteritis JRA - Juvenile Rheumatoid Arthritis UTI - Urinary tract infection

Data processing/preparation Reference TextsMMTx ULMS concepts /feature prototypes/ Filtering /focus on 26 semantic types/ Features /UMLS concept IDs/ Clinical DocumentsMMTx Filtering using existing space Final data UMLS concepts MMTx – discovers UMLS concepts in text

Semantic types used Values indicate the actual numbers of concepts found in: I – clinical texts II – reference texts

Data - statistics 10 classes 10 classes 4534 vectors 4534 vectors 807 features (out of 1097 found in reference texts) 807 features (out of 1097 found in reference texts)Baseline: Majority: 19.1% (asthma class) Majority: 19.1% (asthma class) Content based: 34.6% (frequency of class name in text) Content based: 34.6% (frequency of class name in text)Remarks: Very sparse vectors Very sparse vectors Feature values represent term frequency (tf) i.e. the number of occurrences of a particular concept in text Feature values represent term frequency (tf) i.e. the number of occurrences of a particular concept in text

Model of similarity I Intuitions: Initial distance between document D and the reference vectors R k should be proportional to d 0k = ||D – R k ||  1/p(C k ) - 1 If a term i appears in R k with frequency R ik > 0 but does not appear in D the distance d(D,R k ) should increase by  ik = a 1 R ik If a term i does not appear in R k but it has non-zero frequency D i the distance d(D,R k ) should increase by  ik = a 2 D i If a term i appears with frequency R ik > D i > 0 in both vectors the distance d(D,R k ) should decrease by  ik =  a 3 D i If a term i appears with frequency 0 < R ik ≤ D i in both vectors the distance d(D,R k ) should decrease by  ik =  a 4 R ik

Model of Similarity II with the constrains: Given the document D, a reference vector R k and probability p(i|C k ) probability that the class of D is C i should be proportional to: where  ik depends on adaptive parameters a 1,…,a 4 which may be specific for each class. Linear programming technique can be used to estimate a i by maximizing similarity between documents and reference vectors: where k indicates the correct class.

ResultsM0M1M2M3M4M5kNN SSV MLP (300 neur.) SVM (C opt.) 59.3(1.0)60.4(0.1)60.9(0.1)60.5(0.1)59.8(0.01)60.0(0.01) 10 Ref. vectors fold crossvalidation accuracies in % for different feature weightings. M0: tf frequencies; M1: binary data;

Conclusions Medical text contain a large number of rare, specific concepts. Vector representation using standard td x idf weighting leads to poor results A priori knowledge was introduced using single reference vector (this certainly needs improvement). Expert intuitions were formalized in a model to measure similarity of text, with only 4 parameters per class. Linear programming has been used to optimize parameters. Results are quite encouraging. Finding best set of reference vectors and similarity measures for medical documents is an interesting challenge.