Download presentation
Presentation is loading. Please wait.
Published byKenneth Parks Modified over 9 years ago
1
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information
2
Overcoming the Language Problem in Search How can someone searching for violins be made aware that there are also fiddles (and vice versa)?
3
The Language Problem in Information Retrieval Dialects & Contexts The Search Term Recommender 4 Research Questions Exploratory Web Interface Outline
4
“how to obtain the right information for the right user at the right time” (Chu, 2003) Decision Process under Uncertainty Information Retrieval
5
Searching the Needle in the Haystack Which Needle in which Haystack How to express the Needle and the Haystack Language Problem in Information Retrieval Decision Process under Uncertainty
6
Searcher Author Concept Space Concept Space Question Text Search Statement Match! Mapping between searcher and IR system Mapping between author and IR system Mapping between search statement and document Document Language Mapping
7
IR = Language Mapping Exercise Searcher Concept Space Question Search Statement Document Match! Information Retrieval A search statement needs to describe the: searcher’s question (information need) documents that are relevant to a searcher’s question
8
In Linguistics: unlimited semiosis In Information Science: Inter-indexer inconsistency (20-60%) The Language Problem
9
How to alleviate language ambiguity? Ludwig Wittgenstein: Language games Language regions Language is disambiguated within contexts and specialized dialects. Dialects and Contexts
10
How to alleviate language ambiguity for search term selection? Support search term selection: Within the dialect of a specialized community In context Using the language of documents (for term matching) Dialects and Contexts
11
Search Term Recommender Search Statement Specialty Did you mean… Specialty Term Information Collection
12
Search Term Recommender
13
Divide information collection by specialty Association between –specialty terms –documentary terms (subject metadata) Recommend highly associated terms The Search Term Recommender Methodology
14
Term selection support (query expansion & reformulation) Automatic classification Terminology mapping The Search Term Recommender: Applications
15
1.How can specialties & specialty dialects be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be? Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender - Questions
16
Physics, Electrical and Electronic Engineering, Computers and Control Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes Test collection: Inspec Number of documents427,340 Descriptors / Document6.99
17
Biomedicine and Health Document: author, title, source, publication year, publication type, abstract, Mesh Headings Test collection: Medline Ohsumed Collection Number of documents168,463 Mesh Headings / Document3.11
18
1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be? Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions
19
Domain terminology Publication source Bibliometric analysis Social network analysis Subject-specific classification Determine specialty documents in the collection:
20
Inspec test collection by top-level categories in the Inspec classification 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control Ohsumed test collection by journals grouped by subject 33 specialties Identification of Specialties in an Information Collection
21
1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be? Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions
22
Differences in specialty dialects (specialty term overlap) Differences in documentary languages (subject metadata term overlap) Differences in search term recommender suggestions (term suggestion overlap) Differences in Language
23
Inspec Dialects (specialty term overlap) terms analyzed: 60,601 Subject metadata term overlap: 87% Suggested term overlap: 30%
24
Ohsumed Dialects (Specialty term overlap) terms analyzed: 11,663 Subject metadata term overlap: 32% Suggested term overlap: 30%
25
1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be? Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions
26
Comparison: specialty vs. general term suggestions Automatic classification
27
Title: “A search for clusters of protostars in Orion cloud cores” Automatic Classification Originally assigned terms Specialty Search Term Recommender General Search Term Recommender 1.Infrared sources (astronomical) 2.Interstellar molecular clouds 3.Pre-main-sequence stars 4.Star associations 1.Clouds 2.Clusters of galaxies 3.Interstellar molecular clouds 4.Star clusters 5.Pre-main-sequence stars 1.Search problems 2.Clouds 3.Atomic clusters 4.Clusters of galaxies 5.Interstellar molecular clouds Recall: Hit rate2/4 = 0.5 1/4 = 0.25 Precision: Accuracy2/5 = 0.41/5 = 0.2 Evaluation
28
Performance of the STR: Inspec Test Documents: 42,735 Specialties: 3 First 3 suggested: Recall: 13.6% Precision: 11.2%
29
Performance of the STR: Ohsumed First 3 suggested: Recall: 26% Precision: 25.6% Test Documents: 18,733 Specialties: 33
30
1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be? Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions
31
Language differences Collection sizes for training Specificity of Specialties
32
Identifying subspecialties by classification hierarchy –e.g. Computers & Control -- Computer Hardware -- Circuits & Devices Specificity of Specialties - Inspec Test documents: 2425 Specialties: 3
33
Identifying subspecialties by journal within subject –e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal Specificity of Specialties - Ohsumed Test documents: 745 Specialties: 3
34
Inspec http://metadata.sims.berkeley.edu/str/inspec/inspec.html Ohsumed http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html Exploratory Web Interfaces
35
1.How can specialties be identified in an information collection? –Inspec: subject-specific classification –Ohsumed: journal specialty area 2.Do specialty dialects really differ? –Inspec specialties: term overlap 50%, suggestions overlap 30% –Ohsumed specialties: term overlap 30%, suggestions overlap 30% 3.Is performance improved when focusing on specialty dialects? –Inspec specialties: 10% improvement over general STR –Ohsumed specialties: 25% improvement over general STR 4.How specific should specialties be? –Depends: on language differences & collection size Summary
36
Overcoming the Language Problem in Search Search Term Recommender: See also: FIDDLES 50% Discount! Thank you! vivienp@sims.berkeley.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.