A Knowledge-based Approach to Retrieve Scenario Specific Free-text in a Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 ZonicBook/618EZ-Analyst Resonance Testing & Data Recording.
Variations of the Turing Machine
AP STUDY SESSION 2.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Myra Shields Training Manager Introduction to OvidSP.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
Balanced Device Characterization. Page 2 Outline Characteristics of Differential Topologies Measurement Alternatives Unbalanced and Balanced Performance.
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
DRDP Measure Slides by Domain
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Custom Services and Training Provider Details Chapter 4.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
PUBLIC KEY CRYPTOSYSTEMS Symmetric Cryptosystems 6/05/2014 | pag. 2.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Chapter 10: Virtual Memory
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Association Rule Mining
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Sample Service Screenshots Enterprise Cloud Service 11.3.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Center on Knowledge Translation for Disability and Rehabilitation Research Information Retrieval for International Disability and Rehabilitation Research.
Subtraction: Adding UP
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
CINAHL Keyword Searching. This presentation will take you through the procedure of finding reliable information which can be used in your academic work.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Chapter 13 Web Page Design Studio
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer.
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1.step PMIT start + initial project data input Concept Concept.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Knowledge-based Information Management for Biomedical Applications Wesley Chu Computer Science Department University of California Los Angeles, CA
1 KMeD: A Knowledge-Based Multimedia Medical Database System Wesley W. Chu Computer Science Department University of California, Los Angeles
A Knowledge-based Medical Digital Library
Presentation transcript:

A Knowledge-based Approach to Retrieve Scenario Specific Free-text in a Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA

2 NIH Program Project Grant A 5 year $ 10M joint interdisciplinary project between Medical School & CS faculty Project 1-- teleradaiology infrastructure Project 2-- neuroradiology workstation Project 3-- multimedia information architecture Project 4-- natural language processing for medical reports Project 5-- medical digital library

3 Project 5 Personnel Graduate students: Victor Z. Liu Wenlei Mao Qinghua Zou Consultants: Hooshang Kangaloo, M.D. Denies Aberle, M.D. Project leader: Wesley W. Chu

4 Data in a Medical Digital Library Structured data (patient lab data, demographic data,…)--CoBase Images (X rays, MRI, CT scans)--KMeD Free-text Patient reports Teaching files Literature News articles

5 System Overview Patient reports Medical literature Medical Digital Library (MDL) Teaching materials Query results Ad-hoc query Patient report for content correlation News Articles

6 A Sample Patient Report … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. …

7 Treatment- related articles ??? How to treat the disease Diagnosis- related articles ??? How to diagnose the disease Scenario Specific Retrieval … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) … FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. …

8 Challenge I: Indexing Extracting domain-specific key concepts in the free text for indexing Free-text: Lung cancer, small cell, stage II Concept terms in knowledge source: stage II small cell lung cancer Conventional methods use NLP Not scalable Cannot adapt to various forms of word permutation

9 Challenge II: Terms used in the query are too general Expanding the general terms in the query to specific terms that are used in the document Query: lung cancer, diagnosis options Document: … the effectiveness of chest x-ray and bronchography on patients with lung cancer … ? √ Query: lung cancer, chest x-ray, bronchography, …

10 Challenge III: Mismatching between terms used in query and documents Example Query: … lung cancer, … Document 3: anti-cancer drug combinations… ? ? ? Document 1: … lung carcinoma … Document 2: … lung neoplasm …

11 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents

12 IndexFinder: Extracting domain- specific key concepts Technique Permute words from text to generate concept candidates. Use knowledge base to select the valid candidates. Problem Valid candidates may be irrelevant to specific domain indexing.

13 Eliminating irrelevant concepts Syntactic filter: Limit permutation of words within a sentence. Semantic filter: Use the semantic type (e.g. body part, disease, treatment, diagnosis) to filter out irrelevant concepts Use ISA relationship to filter out general concepts and yield specific concepts.

14 IndexFinder Performance Two orders of magnitude faster than conventional approaches No NLP Knowledge base (UMLS) and index files are resided in main memory Time complexity is linear with the number of distinct words in the text Preliminary Evaluation IndexFinder generates 4% more concepts than conventional approaches (using a single noun phrase) All concepts are relevant

15 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents

16 Query Expansion (QE) Queries in the following form benefit from expansion: + e.g. lung cancer e.g. diagnosis options + e.g. lung cancer e.g. chest x-ray, bronchography expansion

17 Traditional QE Appends all terms that statistically co-occur with the key terms in the query Not semantically focused Original Query: lung cancer, diagnosis options expansion Expanded Query: lung cancer, radiotherapy, chemotherapy, antineoplastic agents, survival rate

18 Knowledge-based QE Knowledge source (UMLS, by the NLM) diagnoses Concept Disease or Syndrome Diagnostic Procedure Sign or Symptom Pharmacologic Substance lung cancer chest x-ray Semantic Type Key concept Specific supporting concepts A class of concepts that belong to a Semantic Type Body Parts Injury or Poisoning Semantic Network Metathesaurus diagnoses

19 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents

20 Document: … lung carcinoma …Document: … lung neoplasm …Document: … anti-cancer drug combinations … Phrase-based Vector Space Model (VSM) Query: … lung cancer, … ? Knowledge-source lung cancer = lung carcinoma … √ lung neoplasm … parent_of √ anti-cancer drug combinations missing!!! Query: … lung cancer, … √ ??

21 Phrase-based VSM Examples Query Document [(C ); “lung” “cancer”] … [(C ); “anti” “cancer” “drug” “combin”] … Query: “lung cancer …” Phrases: [(C ); “lung” “cancer”]… Document: “anti-cancer drug combinations …” Phrases: [(C ); “anti” “cancer” “drug” “combin”]…

22 Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 16% 100 queries vs. 5% 50 queries

23 System Overview Patient reports Medical literature Medical Digital Library (MDL) Teaching materials Query results Ad-hoc query Patient report for content correlation News Articles

24 Application: Query Answering via Templates Sample templates: “, treatment,” “, diagnosis ” Query Expansion … Template: “, treatment” lung cancer radiotherapy chemotherapy cisplatin relevant documents IndexFinder lung cancer, treatment Phrase-based VSM

25 Applications (cont’d) Scenario-specific content correlation Query Templates Scenario Selection e.g. treatment, diagnosis, etc. Patient Report Query Expansion … relevant documents Phrase-based VSM IndexFinder

26 Conclusion Knowledge based (UMLS) approach provides scenario-specific medical free-text retrieval IndexFinder – use word permutation as well as syntactic and semantic filtering to extract domain-specific key concepts in the free text for indexing Knowledge-based query expansion – transform general terms in the query into the scenario specific terms used in the documents, giving the query a higher probability of matching with the relevant documents Phrase based indexing – transform document indexing into phrase paradigm (concept and its word stems) to improve retrieve effectiveness

27 Acknowledgement This research is supported in part by NIC/NIH Grant#

28 Indexing of free text Clinical text Prostate, right (biopsy) - fibromuscular and glandular hyperplasia C :biopsy prostate >>T060:Diagnostic Procedure C :prostate hyperplasia >>T046:Pathologic Function C :right >>T080:Qualitative Concept C :hyperplasia fibromuscular >>T046:Pathologic Function C :hyperplasia glandular >>T046:Pathologic Function Concepts The problem: Extract key terms from free text. Represent in standard concept terms (e.g. UMLS concepts) Concept types

29 Extracting domain-specific key concepts Conventional approach Use NLP to discover noun phrases. Map each noun phrase into concepts. Problems A concept that is contained in a noun phrase will not be discovered. Difficult to scale to large text.

30 Generate concept candidates from free text Sort the concept terms (phrases) in the knowledge base (UMLS) by their length and assign each phrase a unique ID. Create an inverted index for the word(s) used in the phrases; each word has a list of phrase IDs. To generate a concept candidate: Remove replicated words. Based on the list of phrase IDs of each word, aggregate the occurrence of each phrase ID. The phrases with ID occurrences that are equal to their phrase lengths are the concept candidates.

31 Demo Test Texts Technically successful left lower lobe nodule biopsy. Preliminary localization CT images again demonstrate a left lower lobe nodule adjacent to the posterior segmental bronchus. CT scans obtained during biopsy demonstrate the coaxial cannula adjacent to the proximal aspect of the nodule. Surrounding pulmonary parenchymal hemorrhage as a result of the biopsy is also noted. There may be a tiny left apical air collection in the pleural space lateral to the apical bulla. Formal cytologic evaluation of the withdrawn specimen is pending at this time, although abnormal appearing "spindle" cells were identified during on-site cytopathologic evaluation of specimen adequacy.

32 References 1.Yuri L. Zieman and Howard L. Bleich. Conceptual Mapping of User’s Queries to Medical Subject Headings. Proc AMIA Suresh Srinivasan, Thomas C. Rindflesch, William T. Hole, Alan R. Aronson, and James G. Mork. Finding UMLS Metathesaurus Concepts in MEDLINE. Proc AMIA Alan R. Aronson, Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program. Proc AMIA Joshua C. Denny, Jeffrey D. Smithers, Anderson Spickard, III, Randolph A. Miller. A New Tool to Identify Key Biomedical Concepts in Text Documents. Proc AMIA National Library of Medicine. Documentation, UMLS Knowledge Sources, 14 th Edition, January Elkin PL, Cimino JJ, Lowe HJ, Aronow DB, Payne TH, Pincetl PS and Barnett GO. Mapping to MeSH: The art of trapping MeSH equivalence from within narrative text. Proc 12th SCAMC, , Tuttle MS, Olson NE, Keck KD, Cole WG, Erlbaum MS, Sherertz DD et al. Metaphrase: an aid to the clinical conceptualization and formalization of patient problems in healthcare enterprises. Methods Inf Med Nov;37(4-5): Hole W. T, Srinivasan S. Discovering Missed Synonymy in a Large Concept-Oriented Metathesaurus. Proc AMIA Symp 2000: Morioka CA, El-Saden S, Duckwiler, G. et al, Workflow Management of HIS/RIS Textual Documents with PACS Image Studies for Neuroradiology, Proc AMIA Symp 2003 (submitted for publication).

33 Performance Comparison Corpus: OHSUMED, 41 queries

34 Traditional QE Statistical-based Any terms that statistically co-occur with the original query terms are appended Not semantically focused May expand terms irrelevant to the “treatment” of “lung cancer” e.g. “survival,” “survival rate,” …

35 Document Retrieval Find free-text documents to answer queries like: “Hyperthermia, leukocytosis, increased intracranial pressure, and central herniation.” “Cerebral edema secondary to infection, diagnosis and treatment.”

36 Vector Space Model (VSM) Leukocytosis Hyperthermia Words as terms d  q  d q

37 Stem-based VSM Morphological variants bear similar content E.g., “edema” and “edemas” Use stemmer to extract stems Lovins stemmer and Porter stemmer Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… Stems: “hypertherm”, “leukocytos”, “increas”, “intracran”, “pressur”… Baseline of comparison

38 Shortcomings of Stem-based VSM Inability to capture multi-word concepts 1. “Increased intracranial pressure” Inability to utilize the relations between concepts: 2. Synonyms: “hyperthermia” and “fever” 3. IS-A relation: “hyperthermia” and “body temperature elevation”

39 Concept-based VSM Uses concepts in knowledge base (KB) as terms KB: Metathesaurus in UMLS Captures multi-word concepts Captures synonyms Query: “Hyperthermia, leukocytosis, increased intracranial pressure”… CUIs: (C ), (C ), (C )…

40 Shortcomings of Concept-based VSM Concepts may be related: E.g. “hyperthermia” and “body temperature elevation” are not identical but related concepts Need to quantify conceptual relations Knowledge bases are often incomplete, which reduces the retrieval effectiveness

41 Shortcomings of Concept-based VSM (cont’d) Concepts may be related: The conceptual similarity measure, s(c i,c j ), quantifies relations between concepts. Knowledge bases are often incomplete, which reduces the retrieval effectiveness.

42 Incompleteness of the Knowledge Bases Missing concepts in KB, e.g., “Infiltrative small bowel process” (), (C ), () In general, concept-based VSM cannot outperform stem-based VSM (cerebral edema)(cerebral lesion) Missing links between related concepts, e.g.,

43 To Compare Retrieval Effectiveness The test set: OHSUMED 106 queries, 14K documents Expert relevance judgment: R or N Retrieval effectiveness: Recall – the percentage of relevant documents retrieved so far Precision – the percentage of retrieved documents that are relevant

44 Evaluation of Phrase-based Document Similarity Due to the conceptual similarity s(c i,c j ) between concepts in p q and p d Due to the stem overlap in p q and p d