Download presentation
Presentation is loading. Please wait.
Published byConnor Clark Modified over 10 years ago
1
WP 10 Multilingual Access Philipp Daumke, Stefan Schulz
2
Multilingual Access - Rationale English as First Language English as Second Language No English Language Skills English as a Foreign Language < 70 % of the world's scientists read in English 80 % of the world's electronically stored information is in English 90 % English articles in Medline (2000) Sources: The British Council, 2005 Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008
3
Non-native speakers Broad range of command of English Reading skills > writing skills Reduced active vocabulary Difficulty in formulating precise queries English as Second Language English as a Foreign Language
4
Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example
5
Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example
6
Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example
7
BootStrep WP 10 - Multilingual access Objectives: –To provide a multilingual search interface to the BootStrep Biolexicon / Bioontology –We do NOT propose to deliver a multilingual extension of the BootStrep biolexicon Query Languages: French, German, English, (Italian) Output language: English Method: Subword-based semantic indexing Resources: –MorphoSaurus multilingual subword lexicon & thesaurus –MorphoSaurus Semantic Indexer
8
Technique: Morphosemantic Indexing Subword-based, multilingual semantic indexing for document retrieval Subwords are atomic, conceptual or linguistic units: –Stems: stomach, gastr, diaphys –Prefixes: anti-, bi-, hyper- –Suffixes: -ary, -ion, -itis –Infixes: -o-, -s- Equivalence classes contain synonymous subwords and their translations: –#derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } –#inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog,... }
9
Segmentation: Myo | kard | itis Herz | muskel | entzünd |ung Inflamm |ation of the heart muscle muscle myo muskel muscul inflamm -itis inflam entzünd Eq Class subword herz heart card corazon card INFLAMM MUSCLE HEART Subword Thesaurus Structure Indexation: #muscle #heart #inflamm #heart #muscle #inflamm #inflamm #heart #muscle Thesaurus: ~21.000 equivalence classes (MIDs) Lexicon entries: –English:~23.000 –German:~24.000 –Portuguese: ~15.000 –Spanish:~11.000 –French:~ 8.000 –Swedish:~10.000 –Italian:~ 4.000
10
Indexing Pipeline
14
Subword-based document transformation Morphosemantic indexer
15
Subword-Based Search Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter
16
Subword-based query transformation Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter
17
Adapting Morphosemantic Indexing of BootStrep BootStrep terminology mostly disjoint from existing clinical terminology Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences) BootStrep Terms for multilingual access –Gene Ontology, InterPro, IntAct, Gene Regulation Ontology, Species Medline subcorpus (about E. coli gene regulation)
18
Ongoing/Completed Tasks Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora (en, de, fr) Multilingual Terminology Browser –2268 GO terms + translations –6925 InterPro terms + translations –2082 IntAct terms + translations –URL: http://www.medinf.uni-freiburg.de/demo/BootStrepBrowser/ Multilingual Search Engine: –Document collection: BootStrep-Medline subset –Languages: English, German, French –Query modes: Author, Title, title + keywords, All
19
Terminology Browser Search Results Further Information Navigation
20
Terminology Browser
21
Multilingual Search Engine
22
To do: Tools and Resources BootStrep-Browser –Integration of Species –Integration of the Gene Regulation Ontology Multilingual Search Engine –Multilingual treatment of acronyms –Inclusion of species synonym list –Dealing with mixed queries (German-English, English-French) –Integration with the fact store Continue lexicon population –Italian terms ?
23
To do: Evaluation Creation of a gold standard –Typical English queries –Find all relevant documents in the E.coli subset CLIR experiments –Translate queries to French and German –Compare mean average precision Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)
24
ImageCLEFMed Benchmark Baseline: monolingual –Stemmed English queries –Stemmed English texts Query translation –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing –Interlingual representation of user queries and documents Morphosemantic Indexing –incorporating disambiguation module English German Portuguese Spanish French Swedish Average Percent of Baseline Top 20 Average Precision
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.