Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Chapter 5: Introduction to Information Retrieval
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Codifying Semantic Information in Medical Questions Using Lexical Sources Paul E. Pancoast Arthur B. Smith Chi-Ren Shyu.
Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Chapter 5: Information Retrieval and Web Search
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Mining and Summarizing Customer Reviews Minqing Hu and Bing Liu University of Illinois SIGKDD 2004.
Taxonomies: Hidden but Critical Tools Marjorie M.K. Hlava President Access Innovations, Inc.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Relevance Feedback and Ontologies How to Make Queries Better.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
1 The BT Digital Library A case study in intelligent content management Paul Warren
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Query Operations Relevance Feedback & Query Expansion.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Web- and Multimedia-based Information Systems Lecture 2.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea.
Opinion Observer: Analyzing and Comparing Opinions on the Web
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Name_____________ Assignment # _____ EQ: HOW CAN I LEARN TO MAKE GOOD DECISIONS?
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering Semantically Enhanced Web Search Results
Chapter 5: Information Retrieval and Web Search
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Layered MorphoSaurus Lexicon Extension

Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological) language Maintenance cost not justified by search engine performance Risk of precision loss due to general language terms

Solution 1 (radical) Abandon present synonym class architecture, consider only stem variations as synonym Example: –remove: {derm, haut}, {hyper, high} –maintain: {diagnos, diagnost}, {bruch, bruech} Expected outcome: –Monolingual IR: Precision + Recall - –Cross-Language IR: seriously hampered Make up strategy: Multiword Thesaurus –maps MID sequences: [m1 m2 m3] -> [m7 m8 m9]

Solution 2 (semiradical) No alteration of lexicon structure Customization of thesaurus export: –Option 1: as is (e.g. for cross-language retrieval) –Option 2: automatically generate new Eq classes on the fly: ignore “has-sense” crack existing eq classes: Example: MID1 = {a, b, c, d, e, f, g} split into MID1’ = {a}; MID1’’ = {b}; MID1’’’ = {c,d,e}; MID1’’’’ = {f,g}; being d and e variants of the stem c, and g a variant of stem f Criterion for stem variants: –lexemes are in the same eq class (before splitting) –lexemes have a Levenshtein edit distance below threshold Advantage: –choice between Full and Lite version maintained –completely automated generation of Lite out of Full

Layering of the lexicon Hypothesis: MIDs play different roles in a domain specific IR context So far we have two layers: –“for indexing” –“not for indexing”

“for indexing” “not for indexing”

Core: strictly domain (e.g. medicine) General Modifiers Stop

MID Characterization S = Stop Irrelevant for document indexing and retrieval Personal pronouns, auxiliary verbs, some prefixes, most derivation suffixes M = Modifier Meaningful and discriminative in local context only Depend on other words Never constitute solely a user query, very low idf negation particles, many adjectives, quantifiers, graduation, modality G = General General language terms that cannot be assigned to any specific domain terminology Most verbs and nouns that are found in a normal lexicon C = Core Domain specific terms Domain queries should contain at least one C term Generally nouns, can only be found in a domain specific lexicon

How to classify MIDs (or subwords ?) by layers S: already done (“not for indexing”) C candidates: MIDs from UMLS (subset) indexing G  M candidates: MIDs from WordNet indexing Separation of M: manually check frequent, nonmedical MIDs extracted from nonmedical corpus

Differentiated treatment in IR context M: completely ignore outside local context –Hyperkalemia -> #highgrade[M] #potassium[C] #blood[C] –retrieve document with: “elevated potassium…..blood” -> #highgrade[M] #potassium[C] ……………. #blood[C] –ignore document with “moderate hypernatriemia but normal potassium..…blood” #moderate[M] #highgrade[M] #sodium[C] #blood[C] #normal[M] #potassium[C] ….. #blood[C]: #potassium[C] outside the scope of # highgrade[M] G: similar treatment, broader scope (window), if outside scope: downranking but not excluding

Differentiated lexicon redesign by layer Layer M: allow big and unspecific classes Layer G: apply Solution 1 or 2 Layer C: continue fine-grained lexicon modelling including semantic relations Much more possibilities of adjustment of retrieval system by requirements –Whether to apply solutions 1 or 2 –On which thesaurus layers –Whether or not apply phrase search or near operator when dealing with “M” classes.