Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
Database Searching Finding Needles in Haystacks Robert Williams Nov 30, 2007 Updated Dec 1,
Searching for Medicines Information New Zealand College of Pharmacists.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Search Engines and Information Retrieval
Disasters and Human Factors Literature Nestor L Osorio Northern Illinois University.
Subject Access in the Digital Age Presented by Carol Bradsher.
Interfaces for Selecting and Understanding Collections.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Indexing 1/2 BDK12-3 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
BME1450: Biomaterials and Biomedical Research Michelle Baratta Engineering & Computer Science Library Maria Buda Dentistry Library.
1 Intra- and interdisciplinary cross- concordances for information retrieval Philipp Mayr GESIS – Leibniz Institute for the Social Sciences, Bonn, Germany.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
IEEE Knowledge Media Networking KMN’02 Keynote Address, CRL, Kyoto Japan, July 11, 2002 Concept Switching in the Interspace: Networking Infrastructure.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
MEDLINE for Medical Research Juliet Ralph and César Pimenta Hilary Term 2007.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
CNI Spring Meeting April 26, 1999 Washington, DC THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory Graduate School.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Shelly Warwick, MLS, Ph.D – Permission is granted to reproduce and edit this work for non-commercial educational use as long as attribution is provided.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
Indexing Mathematical Abstracts by Metadata and Ontology IMA Workshop, April 26-27, 2004 Su-Shing Chen, University of Florida
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
RESEARCH – DOING AND ANALYSING Gavin Coney Thomson Reuters May 2009.
Digital libraries and web- based information systems Mohsen Kamyar.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
CODE (Committee on Digital Environment) July 26, 2000 Rice University THE NET OF THE 21st CENTURY: Concepts across the Interspace Bruce Schatz CANIS Laboratory.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
Workshop on The Transformation of Science Max Planck Society, Elmau, Germany June 1, 1999 TOWARDS INFORMATIONAL SCIENCE Indexing and Analyzing the Knowledge.
Graduate School of Informatics Kyoto University, November 21, 2001 Technologies of the Interspace Peer-Peer Semantic Indexing Bruce Schatz CANIS Laboratory.
Performance Measurement. 2 Testing Environment.
Information Retrieval
Reference Collections: Collection Characteristics.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Introduction to Semantic Metadata & Semantic Web
CS 430: Information Discovery
Advanced search techniques in databases
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
A Suite to Compile and Analyze an LSP Corpus
Collaboration: Bad Words and Strong Documents
Introduction to Search Engines
Presentation transcript:

Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

Overcoming the Language Problem in Search How can someone searching for violins be made aware that there are also fiddles (and vice versa)?

The Language Problem in Information Retrieval Dialects & Contexts The Search Term Recommender 4 Research Questions Exploratory Web Interface Outline

“how to obtain the right information for the right user at the right time” (Chu, 2003)  Decision Process under Uncertainty Information Retrieval

Searching the Needle in the Haystack Which Needle in which Haystack How to express the Needle and the Haystack  Language Problem in Information Retrieval Decision Process under Uncertainty

Searcher Author Concept Space Concept Space Question Text Search Statement Match! Mapping between searcher and IR system Mapping between author and IR system Mapping between search statement and document Document Language Mapping

IR = Language Mapping Exercise Searcher Concept Space Question Search Statement Document Match! Information Retrieval A search statement needs to describe the: searcher’s question (information need) documents that are relevant to a searcher’s question

In Linguistics:  unlimited semiosis In Information Science:  Inter-indexer inconsistency (20-60%) The Language Problem

How to alleviate language ambiguity? Ludwig Wittgenstein: Language games Language regions  Language is disambiguated within contexts and specialized dialects. Dialects and Contexts

How to alleviate language ambiguity for search term selection? Support search term selection: Within the dialect of a specialized community In context Using the language of documents (for term matching) Dialects and Contexts

Search Term Recommender Search Statement Specialty Did you mean… Specialty Term Information Collection

Search Term Recommender

Divide information collection by specialty Association between –specialty terms –documentary terms (subject metadata) Recommend highly associated terms The Search Term Recommender Methodology

Term selection support (query expansion & reformulation) Automatic classification Terminology mapping The Search Term Recommender: Applications

1.How can specialties & specialty dialects be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender - Questions

Physics, Electrical and Electronic Engineering, Computers and Control Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes Test collection: Inspec Number of documents427,340 Descriptors / Document6.99

Biomedicine and Health Document: author, title, source, publication year, publication type, abstract, Mesh Headings Test collection: Medline Ohsumed Collection Number of documents168,463 Mesh Headings / Document3.11

1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

Domain terminology Publication source Bibliometric analysis Social network analysis Subject-specific classification Determine specialty documents in the collection:

Inspec test collection by top-level categories in the Inspec classification 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control Ohsumed test collection by journals grouped by subject 33 specialties Identification of Specialties in an Information Collection

1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

 Differences in specialty dialects (specialty term overlap)  Differences in documentary languages (subject metadata term overlap)  Differences in search term recommender suggestions (term suggestion overlap) Differences in Language

Inspec Dialects (specialty term overlap) terms analyzed: 60,601 Subject metadata term overlap: 87% Suggested term overlap: 30%

Ohsumed Dialects (Specialty term overlap) terms analyzed: 11,663 Subject metadata term overlap: 32% Suggested term overlap: 30%

1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

Comparison: specialty vs. general term suggestions Automatic classification

Title: “A search for clusters of protostars in Orion cloud cores” Automatic Classification Originally assigned terms Specialty Search Term Recommender General Search Term Recommender 1.Infrared sources (astronomical) 2.Interstellar molecular clouds 3.Pre-main-sequence stars 4.Star associations 1.Clouds 2.Clusters of galaxies 3.Interstellar molecular clouds 4.Star clusters 5.Pre-main-sequence stars 1.Search problems 2.Clouds 3.Atomic clusters 4.Clusters of galaxies 5.Interstellar molecular clouds Recall: Hit rate2/4 = 0.5 1/4 = 0.25 Precision: Accuracy2/5 = 0.41/5 = 0.2 Evaluation

Performance of the STR: Inspec Test Documents: 42,735 Specialties: 3 First 3 suggested: Recall: 13.6% Precision: 11.2%

Performance of the STR: Ohsumed First 3 suggested: Recall: 26% Precision: 25.6% Test Documents: 18,733 Specialties: 33

1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

Language differences Collection sizes for training Specificity of Specialties

Identifying subspecialties by classification hierarchy –e.g. Computers & Control -- Computer Hardware -- Circuits & Devices Specificity of Specialties - Inspec Test documents: 2425 Specialties: 3

Identifying subspecialties by journal within subject –e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal Specificity of Specialties - Ohsumed Test documents: 745 Specialties: 3

Inspec Ohsumed Exploratory Web Interfaces

1.How can specialties be identified in an information collection? –Inspec: subject-specific classification –Ohsumed: journal specialty area 2.Do specialty dialects really differ? –Inspec specialties: term overlap 50%, suggestions overlap 30% –Ohsumed specialties: term overlap 30%, suggestions overlap 30% 3.Is performance improved when focusing on specialty dialects? –Inspec specialties: 10% improvement over general STR –Ohsumed specialties: 25% improvement over general STR 4.How specific should specialties be? –Depends: on language differences & collection size Summary

Overcoming the Language Problem in Search Search Term Recommender: See also: FIDDLES 50% Discount! Thank you!