Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Angelos Hliaoutakis, Kalliopi Zervanou, Euripides G.M. Petrakis, Evangelos E. Milios 2006. HIKM

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Current Approach : MMTx Method : AMTEx  C/NC-value method  Use of MeSH Thesaurus as lexical resource Experiments Conclusion Personal Opinions

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation MMTx, the U.S. NLM approach  maps biomedical documents to UMLS term concepts The limitations of MMTx in term extraction: 1) term over-generation 2) term concept diffusion 3) unrelated terms added to the final candidate list MMTx focus on UMLS rather than MeSH  But MEDLINE indexing is based on MeSH To improve the efficiency of automatic indexing of medical documents.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective We propose a new method, AMTEX 1) Improving the efficiency of automatic term extraction by using C/NC-value method. 2) Indexing and retrieval of MEDLINE documents, based on the extraction and mapping of document terms to the MeSH Thesaurus.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Current Approach : MMTx Maps arbitrary text to UMLS Metathesaurus concepts:  Parsing (syntactic analysis - linguistic filter)  Variant Generation (uses SPECIALIST Lexicon)  Candidate Retrieval (mapping process to Metathesaurus Concepts)  Candidate Evaluation (criteria: centrality, variation, coverage, cohesiveness)

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 MMTx Example  Parsing Shallow syntactic analysis of the input text Linguistic filtering: isolates noun phrases e.g. the term “ ocular complications ” is analysed as:  Variant Generation e.g. “ obstructive sleep apnea ” has variants: obstructive sleep apnea, sleep apnea, sleep, apnea, osa,…  Candidate Retrieval Candidate Metathesaurus concepts for the variant “ osa ” : osa [osa antigen], osa [osa gene product] osa [osa protein] osa [obstructive sleep apnea]  Candidate Evaluation Obstructive Sleep apnea1000 Sleep Apnea 901 Apnea827… Sleeping793 Sleepy755 The limitations of MMTx in term extraction: 1. term over-generation 2. term concept diffusion 3. unrelated terms added to the final candidate list

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Method - AMTEx Input Document d, MeSH Ontology C/NC-value Multi-word Term Extraction & Term Ranking Term Mapping Single-word Term Extraction C/NC-value Multi-word Term Extraction & Term Ranking Term Variant Generation Term Expansion Output MeSH Term Lists MeSH Thesaurus Resource

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Step 1 & 2: C/NC value- Multi-word Term Extraction & Ranking Part-of-Speech Tagging Linguistic filtering: Term Extraction - C-value Term Ranking - NC-value Keep terms up to threshold T 1

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Step 3 : Term Mapping Candidate terms are mapped to terms of the MeSH Thesaurus (simple string matching). Only candidate terms matching MeSH are retained. Multi-word candidates not matching MeSH may contain (shorter) MeSH terms.

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Step 4 : Single-word Term Extraction For multi-word terms not matching MeSH  Multi-word are split into single-word terms  Single-word terms are validated against MeSH  Matched MeSH terms are added to term list

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Step 5 : Term Variant Generation Inflectional variants of the extracted terms are identified during term extraction  (C/NC-value) Stemmed term-forms are also available in MeSH and are added to the list of terms

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Step 6 : Term Expansion Each term in the list is expanded with neighbor terms in MeSH The expansion may include terms more than one level higher or lower than the original term, depending on T 2

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiments Precision and Recall measures  Dataset  61 full MEDLINE documents, from PMC database of NCBI Pubmed  MEDLINE documents are paired to respective MeSH index terms, manually assigned by experts  Ground Truth  the set of MeSH document index terms  Benchmark method  MMTx against AMTEx

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Experiments

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Conclusion - AMTEx designed for indexing and retrieval of MEDLINE documents focuses on multi-word term extraction using valid linguistic & statistical criteria based on MeSH - similarly to human indexing selectively expands to term variants & synonyms outperforms the current benchmark MMTx method, reaching better precision & recall

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Personal Opinions Advantage Drawback  … Application  …


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections."

Similar presentations


Ads by Google