Presentation is loading. Please wait.

Presentation is loading. Please wait.

Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.

Similar presentations


Presentation on theme: "Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information."— Presentation transcript:

1 Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information

2 Overcoming the Language Problem in Search How can someone searching for violins be made aware that there are also fiddles (and vice versa)?

3 The Language Problem in Information Retrieval Dialects & Contexts The Search Term Recommender 4 Research Questions Exploratory Web Interface Outline

4 “how to obtain the right information for the right user at the right time” (Chu, 2003)  Decision Process under Uncertainty Information Retrieval

5 Searching the Needle in the Haystack Which Needle in which Haystack How to express the Needle and the Haystack  Language Problem in Information Retrieval Decision Process under Uncertainty

6 Searcher Author Concept Space Concept Space Question Text Search Statement Match! Mapping between searcher and IR system Mapping between author and IR system Mapping between search statement and document Document Language Mapping

7 IR = Language Mapping Exercise Searcher Concept Space Question Search Statement Document Match! Information Retrieval A search statement needs to describe the: searcher’s question (information need) documents that are relevant to a searcher’s question

8 In Linguistics:  unlimited semiosis In Information Science:  Inter-indexer inconsistency (20-60%) The Language Problem

9 How to alleviate language ambiguity? Ludwig Wittgenstein: Language games Language regions  Language is disambiguated within contexts and specialized dialects. Dialects and Contexts

10 How to alleviate language ambiguity for search term selection? Support search term selection: Within the dialect of a specialized community In context Using the language of documents (for term matching) Dialects and Contexts

11 Search Term Recommender Search Statement Specialty Did you mean… Specialty Term Information Collection

12 Search Term Recommender

13 Divide information collection by specialty Association between –specialty terms –documentary terms (subject metadata) Recommend highly associated terms The Search Term Recommender Methodology

14 Term selection support (query expansion & reformulation) Automatic classification Terminology mapping The Search Term Recommender: Applications

15 1.How can specialties & specialty dialects be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender - Questions

16 Physics, Electrical and Electronic Engineering, Computers and Control Document: author, title, source, publication year, abstract, Inspec thesaurus descriptors, Inspec classification codes Test collection: Inspec Number of documents427,340 Descriptors / Document6.99

17 Biomedicine and Health Document: author, title, source, publication year, publication type, abstract, Mesh Headings Test collection: Medline Ohsumed Collection Number of documents168,463 Mesh Headings / Document3.11

18 1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

19 Domain terminology Publication source Bibliometric analysis Social network analysis Subject-specific classification Determine specialty documents in the collection:

20 Inspec test collection by top-level categories in the Inspec classification 3 specialties: Physics, Electrical & Electronic Engineering, Computers & Control Ohsumed test collection by journals grouped by subject 33 specialties Identification of Specialties in an Information Collection

21 1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

22  Differences in specialty dialects (specialty term overlap)  Differences in documentary languages (subject metadata term overlap)  Differences in search term recommender suggestions (term suggestion overlap) Differences in Language

23 Inspec Dialects (specialty term overlap) terms analyzed: 60,601 Subject metadata term overlap: 87% Suggested term overlap: 30%

24 Ohsumed Dialects (Specialty term overlap) terms analyzed: 11,663 Subject metadata term overlap: 32% Suggested term overlap: 30%

25 1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

26 Comparison: specialty vs. general term suggestions Automatic classification

27 Title: “A search for clusters of protostars in Orion cloud cores” Automatic Classification Originally assigned terms Specialty Search Term Recommender General Search Term Recommender 1.Infrared sources (astronomical) 2.Interstellar molecular clouds 3.Pre-main-sequence stars 4.Star associations 1.Clouds 2.Clusters of galaxies 3.Interstellar molecular clouds 4.Star clusters 5.Pre-main-sequence stars 1.Search problems 2.Clouds 3.Atomic clusters 4.Clusters of galaxies 5.Interstellar molecular clouds Recall: Hit rate2/4 = 0.5 1/4 = 0.25 Precision: Accuracy2/5 = 0.41/5 = 0.2 Evaluation

28 Performance of the STR: Inspec Test Documents: 42,735 Specialties: 3 First 3 suggested: Recall: 13.6% Precision: 11.2%

29 Performance of the STR: Ohsumed First 3 suggested: Recall: 26% Precision: 25.6% Test Documents: 18,733 Specialties: 33

30 1.How can specialties be identified in an information collection? 2.Do specialty dialects really differ? 3.Is performance improved when focusing on specialty dialects? 4.How specific should specialties be?  Tested on 2 bibliographic collections: Inspec Medline (Ohsumed collection) The Search Term Recommender System - Questions

31 Language differences Collection sizes for training Specificity of Specialties

32 Identifying subspecialties by classification hierarchy –e.g. Computers & Control -- Computer Hardware -- Circuits & Devices Specificity of Specialties - Inspec Test documents: 2425 Specialties: 3

33 Identifying subspecialties by journal within subject –e.g. Orthopedics -- Clinical Orthopaedics & Related Research journal Specificity of Specialties - Ohsumed Test documents: 745 Specialties: 3

34 Inspec http://metadata.sims.berkeley.edu/str/inspec/inspec.html Ohsumed http://metadata.sims.berkeley.edu/str/ohsumed/ohsumed.html Exploratory Web Interfaces

35 1.How can specialties be identified in an information collection? –Inspec: subject-specific classification –Ohsumed: journal specialty area 2.Do specialty dialects really differ? –Inspec specialties: term overlap 50%, suggestions overlap 30% –Ohsumed specialties: term overlap 30%, suggestions overlap 30% 3.Is performance improved when focusing on specialty dialects? –Inspec specialties: 10% improvement over general STR –Ohsumed specialties: 25% improvement over general STR 4.How specific should specialties be? –Depends: on language differences & collection size Summary

36 Overcoming the Language Problem in Search Search Term Recommender: See also: FIDDLES 50% Discount! Thank you! vivienp@sims.berkeley.edu


Download ppt "Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information."

Similar presentations


Ads by Google