A Knowledge-based Medical Digital Library

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer.
A Knowledge-based Approach to Retrieve Scenario Specific Free-text in a Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Multimedia Database Systems
Kien A. Hua Division of Computer Science University of Central Florida.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.
Image Search Presented by: Samantha Mahindrakar Diti Gandhi.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Knowledge-based Information Management for Biomedical Applications Wesley Chu Computer Science Department University of California Los Angeles, CA
Chapter 5: Information Retrieval and Web Search
Automated Classification of Medical Questions Using Semantic Parsing Techniques Paul E. Pancoast, MD Arthur B. Smith, MS Chi-Ren Shyu, PhD University of.
1 KMeD: A Knowledge-Based Multimedia Medical Database System Wesley W. Chu Computer Science Department University of California, Los Angeles
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Chapter 6: Information Retrieval and Web Search
Knowledge-Based Semantic Interpretation for Summarizing Biomedical Text Thomas C. Rindflesch, Ph.D. Marcelo Fiszman, M.D., Ph.D. Halil Kilicoglu, M.S.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 KMeD: A Knowledge-Based Multimedia Medical Database System Wesley W. Chu Computer Science Department University of California, Los Angeles
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Information Retrieval in Practice
Information Retrieval in Practice
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
NeurOn: Modeling Ontology for Neurosurgery
Text Based Information Retrieval
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Personalized Social Image Recommendation
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Federated & Meta Search
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
Information Retrieval
CS 430: Information Discovery
CSE 635 Multimedia Information Retrieval
Citation-based Extraction of Core Contents from Biomedical Articles
Ying Dai Faculty of software and information science,
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Panagiotis G. Ipeirotis Luis Gravano
Authors: C. Shyu, C.Brodley, A. Kak, A. Kosaka, A. Aisen, L. Broderick
CS246: Information Retrieval
CHAPTER 7: Information Visualization
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Submitted By : Pratish Singh Kuldeep Choudhary Chinmay Panchal
Retrieval Performance Evaluation - Measures
Introduction to Search Engines
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

A Knowledge-based Medical Digital Library 9/16/2018 A Knowledge-based Medical Digital Library Wesley W. Chu Computer Science Dept, UCLA wwc@cs.ucla.edu SEDE06

Data in a Medical Digital Library 9/16/2018 Data in a Medical Digital Library Structured data (patient lab data, demographic data,…)--CoBase Images (X rays, MRI, CT scans)--KMeD Free-text--KMeX Patient reports Teaching files Literature News articles 9/16/2018 2 SEDE06

Medical Digital Library 9/16/2018 System Overview query Medical Digital Library relevant information free-text data (e.g., medical literature, news articles, etc.) image data (e.g., X-ray images, CT images, etc.) structured data (e.g., lab results, patient demo-graphic data) 9/16/2018 3 SEDE06

Benefits of knowledge based Medical Digital library 9/16/2018 Benefits of knowledge based Medical Digital library Content Based Information Retrieval Transforms patient records into a sea of information sources Provides scenario-specific information for patient care, medical research and education. 9/16/2018 4 SEDE06

Characteristics of Medical Queries 9/16/2018 Characteristics of Medical Queries Multimedia Temporal Evolutionary Spatial Imprecise 9/16/2018 5 SEDE06

CoBase: Cooperatrive Database www.cobase.cs.ucla.edu 9/16/2018 CoBase: Cooperatrive Database www.cobase.cs.ucla.edu Use knowledge base to: Derive Approximate Answers Answer Conceptual Queries Provide Associative Query Answers 9/16/2018 6 SEDE06

KB: Type Abstraction Hierarchy (TAH) 9/16/2018 KB: Type Abstraction Hierarchy (TAH) Using clustering technique to group similar: Attribute values Image features Spatial relationships among objects Provides multi-level knowledge (conceptual) representation 9/16/2018 7 SEDE06

Data mining for KB--TAH 9/16/2018 Data mining for KB--TAH Clustering data of an attribute: Value--difference between the exact value and the returned approximate value Frequency-- probability of occurrence for each value Can be extended to multiple attributes 9/16/2018 8 SEDE06

Type Abstraction Hierarchies for Medical Domain 9/16/2018 Type Abstraction Hierarchies for Medical Domain Tumor (location, size) Class X [loc1 loc3] [s1 s3] Class Y [locY sY] X1 [loc1 s1] X2 [loc2 s2] X3 [loc3 s3] Age Preteens 9 10 11 12 Teen Adult Ethnic Group Asian Korean Chinese Japanese Filipino African European 9/16/2018 9 SEDE06

Generalization and Specialization in TAH 9/16/2018 Generalization and Specialization in TAH More Conceptual Query Specific Query Conceptual Query Generalization Specialization 9/16/2018 10 SEDE06

Query Relaxation Display Query Yes Relax Database Answers Attribute No 9/16/2018 Query Relaxation Relax Attribute Query Yes Display Modification Answers Database TAHs No 9/16/2018 11 SEDE06

Cooperative Querying for Medical Applications 9/16/2018 Cooperative Querying for Medical Applications Query Find the treatment used for the tumor similar-to (loc, size) X1 on 12 year-old Korean males. Relaxed Query Find the treatment used for the tumor Class X on preteen Asians. Association The success rate, side effects, and cost of the treatment. 9/16/2018 12 SEDE06

Medical Digital Library 9/16/2018 System Overview query Medical Digital Library relevant information free-text data (e.g., medical literature, news articles, etc.) image data (e.g., X-ray images, CT images, etc.) structured data (e.g., lab test results, patient demographic data) 9/16/2018 13 SEDE06

KMeD: Retrival images by contents www.kmed.cs.ucla.edu 9/16/2018 KMeD: Retrival images by contents www.kmed.cs.ucla.edu PI: Wesley Chu, Ph.D, Computer Science Department Co-PIs: A. Cardenas, Ph.D, Computer Science Department Ricky Taira , Ph.D, School of Medicine Consultants: Denies Aberle, M.D. C.M. Breant, Ph.D Graduate students: Alex Bui Christina Chu John Dionisio T. Plattner D. Johnson C. Hsu T. Ieong 9/16/2018 14 SEDE06

KMeD Objectives Matching images based on features 9/16/2018 KMeD Objectives Matching images based on features Processing of queries based on spatial relationships among objects Answering of imprecise queries Visual query interface 9/16/2018 15 SEDE06

KMeD: Retrieval of Images by Features & Content 9/16/2018 KMeD: Retrieval of Images by Features & Content Features size, shape, texture, density, histology Spatial Relations angle of coverage, shortest distance, overlapping ratio, contact ratio, relative direction Evolution of Object Growth fusion, fission 9/16/2018 16 SEDE06

9/16/2018 9/16/2018 17 SEDE06

9/16/2018 9/16/2018 18 SEDE06

9/16/2018 9/16/2018 19 SEDE06

Knowledge-Based Image Model 9/16/2018 Brain Tumor Lateral Ventricle TAH SR(t,b) Tumor Size SR(t,l) SR: Spatial Relation b: Brain t: Tumor l: Lateral Ventricle Knowledge Level Schema Level Representation Level (features and content) 9/16/2018 21 SEDE06

Knowledge- Based Query Processing Query Analysis and Feature Selection 9/16/2018 Queries Query Analysis and Feature Selection Knowledge- Based Query Processing Knowledge-Based Content Matching Via TAHs Query Relaxation Query Answers 9/16/2018 22 SEDE06

User Model To customize users’ 9/16/2018 User Model To customize users’ interest and preference, needs, and goals. e.g. query conditions, relaxation control, etc. User type Default Parameter Values Feature and Content Matching Policies Complete Match Partial Match 9/16/2018 23 SEDE06

User Model (cont.) Relaxation Control Policies Measure for Ranking 9/16/2018 User Model (cont.) Relaxation Control Policies Relaxation Order Unrelaxable Object Preference List Measure for Ranking Triggering conditions 9/16/2018 24 SEDE06

9/16/2018 9/16/2018 26 SEDE06

Visual Query Language and Interface 9/16/2018 Visual Query Language and Interface Point-click-drag interface Objects may be represented by icons Spatial relationships among objects are represented graphically 9/16/2018 27 SEDE06

9/16/2018 9/16/2018 28 SEDE06

9/16/2018 Visual Query Example Retrieve brain tumor cases where a tumor is located in the region as indicated in the picture 9/16/2018 29 SEDE06

9/16/2018 9/16/2018 30 SEDE06

9/16/2018 9/16/2018 31 SEDE06

9/16/2018 9/16/2018 32 SEDE06

9/16/2018 9/16/2018 33 SEDE06

A KB Medical Digital Library www.cobase.cs.ucla.edu 9/16/2018 A KB Medical Digital Library www.cobase.cs.ucla.edu query Medical Digital Library relevant information free-text data (e.g., medical literature, news articles, etc.) image data (e.g., X-ray images, CT images, etc.) structured data (e.g., lab test results, patient demographic data) 9/16/2018 34 SEDE06

KMeX www.cobase.cs.ucla.edu 9/16/2018 KMeX www.cobase.cs.ucla.edu Project leader: Wesley W. Chu Consultants: Hooshang Kangaloo, M.D. Denies Aberle, M.D. Graduate students: Victor Z. Liu Wenlei Mao Qinghua Zou 9/16/2018 35 SEDE06

A Sample Patient Report 9/16/2018 A Sample Patient Report … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. 9/16/2018 36 SEDE06

Scenario-Specific Queries 9/16/2018 Scenario-Specific Queries Queries that mention one or more scenarios E.g., keratoconus treatment lung cancer diagnosis and complications A scenario (e.g., treatment): a repeating healthcare situation >60% medical queries are scenario specific [HMW90, HPH96, EOE99, EOG00, WMH01] 9/16/2018 37 SEDE06

Scenario Specific Retrieval 9/16/2018 Scenario Specific Retrieval … Tissue Source: LUNG (FINE NEEDLE ASPIRATION) (LEFT LOWER LOBE) FINAL DIAGNOSIS: - LUNG NODULE, LEFT LOWER LOBE (FINE NEEDLE ASPIRATION): - LUNG CANCER, SMALL CELL, STAGE II. Diagnosis-related articles ??? How to diagnose the disease Treatment-related articles ??? How to treat the disease 9/16/2018 38 SEDE06

Challenge II: Terms in the query are too general 9/16/2018 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms in the query and the documents 9/16/2018 39 SEDE06

IndexFinder http://fargo.homedns.org/umls/demo.aspx 9/16/2018 IndexFinder http://fargo.homedns.org/umls/demo.aspx Extract key information from clinical free texts Search relevant reports Search similar patients Medical KB (UMLS) provides standard medical concepts IndexFinder Extracts UMLS concepts from clinical texts Clinical Texts Clinical texts are important information sources which include clinical notes, surgical notes, discharge summary, radiology reports, etc. In many situations, doctor needs to search relevant reports of a patient or find a similar patient. To improve the quality of free text search, we need to extract key information from free text and represent it in standard terms. So “What is the key information in a free text report?”. Fortunately, the unified medical language system provides the answer. UMLS as a collection of more than 100 biomedical sources defines the key medical concepts. Therefore, we are very interested in extracting UMLS concepts from clinical texts Extract key info. Standard terms 9/16/2018 40 SEDE06

Previous Approaches lambs oats UMLS Mapping UMLS Concepts Free text ip 9/16/2018 Previous Approaches UMLS Mapping UMLS Concepts Free text ip dp i1 i0 vp lambs will v0 eat oats NLP Parser Noun phrases lambs oats The conventional approaches of extracting concepts from free text are like this: Start from free text, Use natural language processing to get a parse tree. From the tree, get a list of noun phrases. Then map each noun phrase against UMLS to get concepts. 9/16/2018 41 SEDE06

Problems of Previous Approaches 9/16/2018 Problems of Previous Approaches Concepts cannot be discovered if they are not in a single noun phrase. E.g. In “second, third, and fourth ribs”, “Second rib” can not be discovered. Difficult to scale to large text computing. Natural language processing requires significant computing resources The previous approaches have two main problems: First of all, concepts can not be discovered if they are not in a single noun phrase. For example, in the text “second, third, and fourth ribs”, the concept “second rib” can not be discovered since the word “second” and “ribs” are in different noun phrases. Second, it is difficult to scale to large text computing since natral language processing requires significant computing resources. 9/16/2018 42 SEDE06

Our Approach: IndexFinder (Zou et.al 03) 9/16/2018 Our Approach: IndexFinder (Zou et.al 03) Previous: free textUMLS Our approach: UMLSfree text Free text NLP Parser Noun phrases UMLS Mapping Concepts Indexing Index Data ~80MB UMLS 2GB Index phase (offline) concepts Filtering Extracting Free text Search phase (real time) We proposed a new technique called IndexFinder. Previous approach is from free text to UMLS as shown in this graph. From free text, to natural language processing, to noun phrases. Mapping individual noun phrase against UMLS to get concepts. Can we do better? Let us suppose UMLS contains only a single concept “lung cancer”. What will we do? Do we need all the these processes? We would discard all words in the free text except the two words “lung” and “cancer”. Our approach is from UMLS to free text. First, the offline index phase. It indexes the relevant part of UMLS into a compact index data which can loaded into computer main memory to answer query without using any database. Second, the real time search phase. It first extracts concept candidates and then applies filters. Suppose UMLS contains only “Lung cancer” We would discard all words in the text except “lung” and “cancer”. 9/16/2018 43 SEDE06

Knowledge-based approach 9/16/2018 Knowledge-based approach Using the compact index data without using any database system. Permuting words in a sentence to generate UMLS concept candidates. Using filters to eliminate irrelevant concepts. IndexFinder is a knowledge-based approach. Using the compact index data to answer query directly without using any database system. Permuting words in a sentence to generate UMLS concept candidates. And using filters to eliminate irrelevant concepts. 9/16/2018 44 SEDE06

Eliminate irrelevant concepts 9/16/2018 Eliminate irrelevant concepts Syntactic filter: Limit the # of word combinations within a sentence. Semantic filter: Using semantic types (e.g. body part, disease, treatment, diagnose) Using the ISA relationship and filter out general terms and keep the specific ones. After generating concept candidates, we use filters to eliminate irrelevant concepts. We proposed two kind of filters: The first is syntactic filter. It limits word combination within a sentence. The second is semantic filter. We can filter concepts by semantic types as body part, disease, treatment, etc. We can also use the ISA relationship to remove the general concepts and keep more specific ones. 9/16/2018 46 SEDE06

Comparison of Indexfinder with MetaMap 9/16/2018 Comparison of Indexfinder with MetaMap Input: A small mass was found in the left hilum of the lung. MetaMap MetaMap is a well known natural language processing approach to extract UMLS concepts. We’ve compared IndexFinder with MetaMap. And here is an example. For the input text “”. MetaMap found four concepts: Mass, Small, Left hilum, and Lung as in blue. IndexFinder returned a ranked list of four concepts. The top three of the list, lung left hilum, left lung mass, and a mass cannot be discovered by MetaMap. IndexFinder 9/16/2018 47 SEDE06

Topic Directory Using indexing for document retrieval can not provide: 9/16/2018 Topic Directory Using indexing for document retrieval can not provide: Standard vocabulary Cross reference among topics Scenario specific search Topic directory resolves these shortcomings by dynamically clustering documents into knowledge based topics based on user specified scenarios 9/16/2018 49 SEDE06

9/16/2018 The Mismatch Problem Scenario concepts are too general to match specialized ones in relevant docs Expanded Query: keratoconus, treatment, contact lens, epikeratoplasty, epikeratophakia … Document 1: … The use of contact lens after keratoconic epikeratoplasty… Query: keratoconus, treatment Document 2: … Epikeratophakia for aphakia, keratoconus, and myopia … 9/16/2018 52 SEDE06

9/16/2018 Basic Idea Start from pairs of frequently co-occurring concepts [Qiu03, Jing94, Xu96] Apply knowledge structures to filter out pairs that are “irrelevant” to a given scenario, e.g., treatment 9/16/2018 54 SEDE06

Sample Co-Occurring Pairs 9/16/2018 Sample Co-Occurring Pairs Concepts most frequently co-occurring with keratoconus keratoconus griffonia contact lens acute hydrops central cornea corneal penetrating keratoplasty epikeratoplasty 9/16/2018 55 SEDE06

UMLS – The Knowledge Source 9/16/2018 UMLS – The Knowledge Source Three major components: The MetaThesaurus > 800K medical concepts, <ID, multiple string forms> E.g., <“C0022578,” {“Keratoconus,” “Cornea conical”}> Used for detecting concepts from free text The Semantic Network ~100 semantic types, ~50 relations among types E.g., “Disease or Syndrome,” – containing 44,000 concepts Used for deriving scenario-specific relationships The SPECIALIST Lexicon 9/16/2018 56 SEDE06

Structure of The Knowledge Source 9/16/2018 Structure of The Knowledge Source The Semantic Network Disease or Syndrome Pharmocological Substance treats keratoconus insulin The Meta-Thesaurus acute hydrops keratoconus lactase 9/16/2018 57 SEDE06

Fragment of The Semantic Network for Each Scenario 9/16/2018 Fragment of The Semantic Network for Each Scenario E.g., the treatment scenario Therapeutic or Preventive Procedure treats Medical Device Disease or Syndrome Pharmocological Substance treats treats 9/16/2018 58 SEDE06

Filtering Therapeutic or Preventive Procedure Disease or Syndrome 9/16/2018 Filtering Therapeutic or Preventive Procedure Disease or Syndrome Pharmocological Substance Medical Device treats corneal keratoconus griffonia contact lens penetrating keratoplasty epikeratoplasty treats central cornea penetrating keratoplasty acute hydrops epikeratoplasty contact lens keratoconus griffonia 9/16/2018 59 SEDE06

Knowledge-Based Query Expansion 9/16/2018 Knowledge-Based Query Expansion Original query: <ckey, {cs}> ckey, a key concept, e.g., keratoconus {cs}, a set of scenario concepts, e.g., treatment {ce}, concepts having scenario-specific relationships with ckey ckey  cs  ce, e.g., keratoconus  treats  contact lens Expanded query: <ckey, {cs}, {ce}> E.g., keratoconus, treatment, contact lens, epikeratoplasty, epikeratophakia… 9/16/2018 60 SEDE06

Need for Weight Adjustments 9/16/2018 Need for Weight Adjustments Weight adjustments needed to compensate for the filtering c’e v’e ce ve fuchs dystrophy 0.289 penetrating keratoplasty 0.247 epikeratoplasty 0.230 epikeratophakia 0.119 corneal ectasia 0.168 keratoplasty 0.103 acute hydrops 0.165 contact lens 0.101 keratometry 0.133 thermokeratoplasty 0.092 corneal topography 0.132 button 0.067 corneal 0.130 secondary lens implant 0.057 aphakic corneal edema 0.122 fittings adapters 0.048 esthesiometer 0.043 statistical expansion knowledge-based expansion 9/16/2018 63 SEDE06

9/16/2018 The OHSUMED Testbed A testbed: a benchmark query set, a corpus, relevance judgments for each query OHSUMED [HBL94] 57 scenario-specific queries e.g., keratoconus treatment thrombocytosis treatment and diagnosis diagnostic and theraputic work up of breast mass 348K MEDLINE articles (title + abstract), 1988 – 1992 How do we identify the scenario concepts in each query The dataset is not the same as the one in the metasearching problem 9/16/2018 64 SEDE06

Comparison Under Different Expansion Sizes 9/16/2018 Comparison Under Different Expansion Sizes s – expansion size Metric – avgp Why appending co-occurring terms can be helpful in the first place Why would we consider it as a significant improvement comparing the top two curves? 9/16/2018 66 SEDE06

Summary of Query Expansion 9/16/2018 Summary of Query Expansion Knowledge based approach selects more scenario specific terms than statistical approach and achieves better performance Different “quality” of knowledge structure for different scenarios yield different performance improvements 9/16/2018 67 SEDE06

Challenge II: Terms in the query are too general 9/16/2018 Challenge I: Indexing Challenge II: Terms in the query are too general Challenge III: Mismatch between terms used in the query and the documents causes problems in ranking of results 9/16/2018 68 SEDE06

Challenge III: Mismatching between terms used in query and documents 9/16/2018 Challenge III: Mismatching between terms used in query and documents Example Query: … lung cancer, … ? ? ? Document 1: … lung carcinoma … Document 3: anti-cancer drug combinations… Document 2: … lung neoplasm … 9/16/2018 69 SEDE06

Ranking query results Traditional approach Word Stem based Vector Space Model (VSM) Concept based VSM New approach Phrase based (word +concept) VSM 9/16/2018 70

Phrase-based Vector Space Model (VSM) 9/16/2018 Phrase-based Vector Space Model (VSM) Query: … lung cancer, … Query: … lung cancer, … ? ? √ √ √ ? lung cancer = lung carcinoma … missing!!! parent_of anti-cancer drug combinations Document: … lung neoplasm … Document: … lung carcinoma … Document: … anti-cancer drug combinations … Document: … anti-cancer drug combinations … lung neoplasm … Knowledge-source 9/16/2018 71 SEDE06

Phrase-based VSM Examples 9/16/2018 Phrase-based VSM Examples Query: “lung cancer …” Phrases: [(C0242379); “lung” “cancer”]… Document: “anti-cancer drug combinations …” Phrases: [(C0003393); “anti” “cancer” “drug” “combin”]… To resolve these two problems, we use a “phrase” to keep track of both the concept and its word stems. For example, Our query becomes (bullet 1). Here we use a pair of brackets to enclose a phrase. Just like in the concept-based VSM, we first detect concepts C0015967 etc., but this time, we keep all the stems together with the CUIs. “Infiltrative small bowel process” becomes (bullet 2). Notice how the stems provide useful information for the “unknown” concepts. As for “cerebral edema” and “cerebra lesion”, although there is no relation between their CUIs, the share stem “cerebr” shows that they are actually related. Query Document [(C0242379); “lung” “cancer”] … [(C0003393); “anti” “cancer” “drug” “combin”] … 9/16/2018 72 SEDE06

Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 9/16/2018 Retrieval Effectiveness Comparison (Corpus: OHSUMED, KB: UMLS) 16% 100 queries vs. 5% 50 queries The baseline for comparison is the stem-based VSM as we said before. Here, we plot the precision values at the 11 recall points. If we use concepts as terms and treat different concepts as unrelated, we arrived at the (Concepts Unrelated) line. The result is significantly worse than the baseline. (28%) Taking the concept inter-relationship into consideration (Concepts), we achieve a significant improvement over (Concept Unrelated). The average effectiveness is similar to that of the baseline. On the other hand, if we consider contribution of both the stems and the concept in a phrase, but treating different concepts as unrelated (Phrases, Concepts Unrelated), we also achieve significant improvement over the (Concept, Unrelated) line. The improvement over the baseline is not significant. Considering both stem contribution and the concept interrelationships (Phrases), we achieve a 16% improvement over the baseline. Remember that in information retrieval, a 5% improvement in average precision over 50 queries is considered significant, the 16% improvement shown here warrants a paradigm change from stem-based VSM to phrase-based VSM. 9/16/2018 73 SEDE06

9/16/2018 Experimental Results Knowledge based query expansion (KQE) is superior to statistical query expansion. Knowledge based phrase vector space model (PVSM) is superior to stem based vector space model (SVSM). KQE + PVSM can yield 15-20% improvements in precision/recall than SVSM. 9/16/2018 76 SEDE06

KMeX Demo Ad-hoc query Medical Digital Library (free text documents) 9/16/2018 KMeX Demo Ad-hoc query Medical Digital Library (free text documents) Patient report for content correlation Query results Patient reports Medical literature Teaching materials News Articles 9/16/2018 77 SEDE06

Query Answering via Templates 9/16/2018 Query Answering via Templates Sample templates: “<disease>, treatment,” “<disease>, diagnosis ” relevant documents Phrase-based VSM lung cancer lung cancer Query Expansion IndexFinder radiotherapy chemotherapy Template: “<disease>, treatment” lung cancer, treatment … cisplatin 9/16/2018 78 SEDE06

9/16/2018 9/16/2018 80 SEDE06

9/16/2018 9/16/2018 81 SEDE06

9/16/2018 9/16/2018 82 SEDE06

9/16/2018 9/16/2018 83 SEDE06

9/16/2018 9/16/2018 84 SEDE06

9/16/2018 Future Applications Patient: searches for relevant literature and specialists regarding the treatment of his/her specific disease. Healthcare providers: identifies other individuals with similar demography and disease, discover the success rates and side effects of treatment methods used. Medical researchers: studies the characteristics of new diseases and the effectiveness of treatment methods for those diseases 9/16/2018 90 SEDE06

Acknowledgments This research was supported by: Darpa F30602-94-C-0207 9/16/2018 Acknowledgments This research was supported by: Darpa F30602-94-C-0207 NSF grant # IIS-0097438 NIC/NIH Grant #4442511-33780 9/16/2018 91 SEDE06