HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University.

Slides:



Advertisements
Similar presentations
I. Spasić,1 D. Schober,2 S. Sansone,2 D. Rebholz-Schuhmann,2 D
Advertisements

Improved TF-IDF Ranker
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Searching for Medicines Information New Zealand College of Pharmacists.
Indexing the Biomedical Literature in a Time of Increased Demand and Limited Resources BioASQ Workshop September 27, 2013 Alan R. Aronson Lister Hill Center,
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
U. S. National Library of Medicine NLM Indexing Initiative Tools for NLP: MetaMap and the Medical Text Indexer Natural Language Processing: State of the.
Unsupervised Ontology Acquisition from plain texts: The OntoGain method Efthymios Drymonas Kalliopi Zervanou Euripides G.M. Petrakis Intelligent Systems.
NLM Medical Text Indexer (MTI) BioASQ Challenge Workshop September 27, 2013 J.G. Mork, A. Jimeno Yepes, A. R. Aronson.
Using text mining techniques to support the expansion of controlled vocabularies Irena Spasić
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Automatic Document Categorisation by User Profile in MEDLINE Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory
U. S. National Library of Medicine Welcome to the first MMTx User’s Group Meeting AMIA 2003 November 11, 2003.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Annual reports and feedback from UMLS licensees Kin Wah Fung MD, MSc, MA The UMLS Team National Library of Medicine Workshop on the Future of the UMLS.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
Survey of Medical Informatics CS 493 – Fall 2004 September 27, 2004.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
An Automatic Retrieval System for Expert and Consumer Users Rena Peraki, Euripides G.M. Petrakis Angelos Hliaoutakis Intelligent Systems Laboratory
Lexical Tools Briefing The Lexical Systems Group NLMNLM. LHNCBC. CGSBLHNCBCCGSB June, 2006.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
UMLS Unified Medical Language System. What is UMLS? A Unified knowledge representation system Project of NLM Large scale Distributed First launched in.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
CODE (Committee on Digital Environment) July 26, 2000 Rice University THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.
Graduate School of Informatics Kyoto University, November 21, 2001 Technologies of the Interspace Peer-Peer Semantic Indexing Bruce Schatz CANIS Laboratory.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
MedKAT Medical Knowledge Analysis Tool December 2009.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
MEDLINE®/PubMed® PubMed for Trainers, Fall 2015 U.S. National Library of Medicine (NLM) and NLM Training Center An introduction.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Extracting Semantic Concept Relations
Citation-based Extraction of Core Contents from Biomedical Articles
MedSearch is a retrieval system for the medical literature
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Presentation transcript:

HIKM’2006AMTEx Automatic Document Indexing in Large Medical Collections Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis Technical University of Crete, Chania, Greece Evangelos E. Milios Dalhousie University, Halifax, Canada

HIKM’2006AMTEx Overview The need for automatic assignment of index terms in large medical collections MMTx (by the US NLM) The AMTEx approach to medical document indexing AMTEx resources: MeSH & C/NC value Experiments & evaluation Discussion and future research

HIKM’2006AMTEx Motivation and Objectives MeSH is a taxonomy of medical terms Subset of UMLS Metathesaurus MEDLINE is indexed by MeSH terms (assigned by experts) Other medical texts need to be associated with MEDLINE, e.g. consumer medical literature Need for automatic assignment of MeSH terms to any medical text

HIKM’2006AMTEx MMTx (MetaMap Transfer) Maps arbitrary text to UMLS Metathesaurus concepts:  Parsing to extract noun phrases (syntactic analysis - linguistic filter)  Variant Generation (uses SPECIALIST Lexicon)  Candidate Retrieval (mapping process to Metathesaurus Concepts)  Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)

HIKM’2006AMTEx MMTx Example  Parsing Shallow syntactic analysis of the input text Linguistic filtering: isolates noun phrases  Variant Generation e.g. “obstructive sleep apnea” has variants: obstructive sleep apnea, sleep apnea, sleep, apnea, osa,…  Candidate Retrieval Candidate Metathesaurus concepts for the variant “osa” : osa [osa antigen], osa [osa gene product] osa [osa protein] osa [obstructive sleep apnea]  Candidate Evaluation Obstructive Sleep apnea1000 Sleep Apnea 901 Apnea827… Sleeping793 Sleepy755

HIKM’2006AMTEx MMTx limitations MMTx focus on UMLS rather than MeSH  But MEDLINE indexing is based on MeSH Exhaustive variant generation: the initial phrase is iteratively expanded into all possible UMLS variants term overgeneration term concept diffusion unrelated terms added to the final candidate list

HIKM’2006AMTEx The AMTEx method New method for automatic indexing of medical documents Main idea: Initial term extraction based on a hybrid linguistic/statistical approach, the C/NC value Extracts general single and multi-word terms Extracted terms are validated against MeSH

HIKM’2006AMTEx ΑΜΤΕx Outline INPUT: Document Collection INPUT: Document Collection C/NC value Multi-word Term Extraction & Term Ranking C/NC value Multi-word Term Extraction & Term Ranking MeSH Term Validation MeSH Term Validation Single-word Term Extraction Non-MeSH multi-word are broken down & validated against MeSH Single-word Term Extraction Non-MeSH multi-word are broken down & validated against MeSH Variant Generation Term Expansion (MeSH) Term Expansion (MeSH) MeSH Thesaurus Resource MeSH Thesaurus Resource OUTPUT: MeSH Term Lists OUTPUT: MeSH Term Lists

HIKM’2006AMTEx MeSH: Medical Subject Headings The NLM medical & biological terms thesaurus: Organized in IS-A hierarchies –more than 15 taxonomies & more than 22,000 terms –a term may appear in multiple taxonomies No PART-OF relationships Terms organized into synonym sets called entry terms, including stemmed term forms

HIKM’2006AMTEx Fragment of the MeSH IS-A Hierarchy Root Nervous system diseases Neurologic manifestations pain headacheneuralgia Cranial nerve diseases Facial neuralgia

HIKM’2006AMTEx The C/NC value method Hybrid (linguistic / statistical) term extraction method Domain independent Specifically designed for the identification of multi-word and nested terms: compound & multi-word terms very common in biomedical domain multi-word terms often used in indexing

HIKM’2006AMTEx C-value C-value: a phrase may be a term, if it often appears alone or within other candidate terms otherwise α: candidate term f(α): frequency T α : set of candidate terms containing α P(T α ): number of such terms

HIKM’2006AMTEx NC-value NC-value: a phrase is more likely a term, if it often appears in specific word context w: context word t(w): number of terms w appears with n: number of all terms f α (w): frequency of w as context word of α

HIKM’2006AMTEx AMTEx step 1: C/NC value Multi-word Term Extraction & Ranking  Part-of-Speech Tagging  Linguistic filtering: N + N (A|N) + N ( (A|N) + | ( (A|N)* (N P)? ) (A|N)* ) N  Candidate term ranking based on C/NC-value  Keep terms with NC-value > T 1

HIKM’2006AMTEx AMTEx step 2: MeSH Term Validation  Candidate terms are validated against the MeSH Thesaurus (simple string matching)  Only candidate terms matching MeSH are kept  Multi-word candidates not matching MeSH may still contain (shorter) MeSH terms

HIKM’2006AMTEx AMTEx step 3: Single-word Term Extraction For multi-word terms not matching MeSH:  Multi-word are split into single-word terms  Single-word terms matched against MeSH  Matched MeSH terms added to term list

HIKM’2006AMTEx AMTEx step 4: Term Variant Generation Variants are added to the list of terms: Inflectional variants of the extracted terms identified during term extraction (C/NC-value) Stemmed term-forms available in MeSH

HIKM’2006AMTEx AMTEx step 5: Term Expansion

HIKM’2006AMTEx AMTEx step 5: Term Expansion Each term in the list is expanded with neighbouring terms in MeSH hierarchy The expansion may include terms more than one level higher or lower than the original term, depending on similarity threshold T Semantic similarity metric by Li et al. Y. Li, Z. A. Bandar, and D. McLean. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Trans. on Knowledge and Data Engineering, 15(4):871–882, July/Aug

HIKM’2006AMTEx Example Input: Full text article MEDLINE index terms: “Aged”, “Data Collection”, “Humans”,“Knee”, “Middle Aged”, “Osteoarthritis, Knee/complications”, “Osteoarthritis, Knee/diagnosis”, “Pain/classification”, “Pain/etiology”, “Prospective Studies”, “Research Support, Non-U.S. Gov’t” MMTx terms: “osteoarthritis knee”, “retention”, “peat”, “rheumatology”, “acetylcholine”, “lysine acetate”, “potassium acetate”, “questionnaires”, “target population”, “population”, “selection bias”, “creativeness”, “reproduction”, “cohort studies”, “europe”, “couples”, “naloxone”, “sample size”, “arthritis”, “data collection”, “mail” ‘health status”, “respondents”, “ontario”, “universities”, “dna”, “baseline survey”, “medical records”, “informatics”, “general practitioners”, “gender”, “beliefs”, “logistic regression”, “female”, “marital status”, “employment status”, “comprehension”, “surveys”, “age distribution”, “manual”, “occupations”, “manuals”, “persons”, “females”, “minor”, “minority groups”, “incentives”, “business”, “ability”, “comparative study”, “odds ratio”, “biomedical research”, “pubmed”, “copyright”, “coding”, “longitudinal studies”, “immunoelectrophoresis”, “skin diseases”, “government”, “norepinephrine”, “social sciences”, “survey methods”, “tyrosine”, “new zealand”, “azauridine”, “gold”, “nonrespondents”, “cycloheximide”, “rheum”, “jordan”, “cadmium”, “radiopharmaceuticals”, “community”, “disease progression”, “history” AMTEx terms: “health surveys”, “pain”, “review publication type”, “data collection”, “osteoarthritis knee”, “knee”, “science”, “health services needs and demand”, “population”, “research”, “questionnaires”, “informatics”, “health”

HIKM’2006AMTEx Evaluation Precision and Recall measures  Dataset: 61 full MEDLINE documents (not abstracts), from PMC database of NCBI Pubmed MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts  Ground Truth: the set of MeSH document index terms  Benchmark method: MMTx against our AMTEx

HIKM’2006AMTEx Multi-Word Terms only MethodPrecisionRecall MMTx0,0130,015 AMTEx (T = 0,5) 0,1860,108 AMTEx (T = 0,6) 0,2180,090 AMTEx (T = 0,7) 0,2360,072 AMTEx (T = 0,8) 0,2360,072 AMTEx (T = 0,9) 0,2360,070 T: term expansion threshold, lower T means further expansion

HIKM’2006AMTEx Contribution of Single-Word Terms MethodPrecisionRecall MMTx0,0130,015 AMTEx0,2360,070 AMTEx & single-word MeSH terms0,1200,228

HIKM’2006AMTEx Conclusions: AMTEx Designed for indexing and retrieval of MEDLINE documents Focuses on multi-word term extraction using valid linguistic & statistical criteria Based on MeSH -- similarly to human indexing Selectively expands into term variants, synonyms Outperforms the current benchmark MMTx method, in both precision & recall

HIKM’2006AMTEx Future Work Better ranking of terms, using semantic similarity Learning of thresholds T 1, T Word sense disambiguation to detect the correct sense for expansion rather than the most common sense Handling shorter documents