WP 10 Multilingual Access Philipp Daumke, Stefan Schulz
Multilingual Access - Rationale English as First Language English as Second Language No English Language Skills English as a Foreign Language < 70 % of the world's scientists read in English 80 % of the world's electronically stored information is in English 90 % English articles in Medline (2000) Sources: The British Council, 2005 Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008
Non-native speakers Broad range of command of English Reading skills > writing skills Reduced active vocabulary Difficulty in formulating precise queries English as Second Language English as a Foreign Language
Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example
Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example
Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example
BootStrep WP 10 - Multilingual access Objectives: –To provide a multilingual search interface to the BootStrep Biolexicon / Bioontology –We do NOT propose to deliver a multilingual extension of the BootStrep biolexicon Query Languages: French, German, English, (Italian) Output language: English Method: Subword-based semantic indexing Resources: –MorphoSaurus multilingual subword lexicon & thesaurus –MorphoSaurus Semantic Indexer
Technique: Morphosemantic Indexing Subword-based, multilingual semantic indexing for document retrieval Subwords are atomic, conceptual or linguistic units: –Stems: stomach, gastr, diaphys –Prefixes: anti-, bi-, hyper- –Suffixes: -ary, -ion, -itis –Infixes: -o-, -s- Equivalence classes contain synonymous subwords and their translations: –#derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } –#inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog,... }
Segmentation: Myo | kard | itis Herz | muskel | entzünd |ung Inflamm |ation of the heart muscle muscle myo muskel muscul inflamm -itis inflam entzünd Eq Class subword herz heart card corazon card INFLAMM MUSCLE HEART Subword Thesaurus Structure Indexation: #muscle #heart #inflamm #heart #muscle #inflamm #inflamm #heart #muscle Thesaurus: ~ equivalence classes (MIDs) Lexicon entries: –English:~ –German:~ –Portuguese: ~ –Spanish:~ –French:~ –Swedish:~ –Italian:~ 4.000
Indexing Pipeline
Subword-based document transformation Morphosemantic indexer
Subword-Based Search Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter
Subword-based query transformation Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter
Adapting Morphosemantic Indexing of BootStrep BootStrep terminology mostly disjoint from existing clinical terminology Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences) BootStrep Terms for multilingual access –Gene Ontology, InterPro, IntAct, Gene Regulation Ontology, Species Medline subcorpus (about E. coli gene regulation)
Ongoing/Completed Tasks Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora (en, de, fr) Multilingual Terminology Browser –2268 GO terms + translations –6925 InterPro terms + translations –2082 IntAct terms + translations –URL: Multilingual Search Engine: –Document collection: BootStrep-Medline subset –Languages: English, German, French –Query modes: Author, Title, title + keywords, All
Terminology Browser Search Results Further Information Navigation
Terminology Browser
Multilingual Search Engine
To do: Tools and Resources BootStrep-Browser –Integration of Species –Integration of the Gene Regulation Ontology Multilingual Search Engine –Multilingual treatment of acronyms –Inclusion of species synonym list –Dealing with mixed queries (German-English, English-French) –Integration with the fact store Continue lexicon population –Italian terms ?
To do: Evaluation Creation of a gold standard –Typical English queries –Find all relevant documents in the E.coli subset CLIR experiments –Translate queries to French and German –Compare mean average precision Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)
ImageCLEFMed Benchmark Baseline: monolingual –Stemmed English queries –Stemmed English texts Query translation –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing –Interlingual representation of user queries and documents Morphosemantic Indexing –incorporating disambiguation module English German Portuguese Spanish French Swedish Average Percent of Baseline Top 20 Average Precision