Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.
SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004.
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East 3057 Australia Presenter Name: Stefan Schulz Country:1. Germany, 2. Brazil Qualification(s):
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East.
Towards a Multilingual Medical Lexicon Kornél Markó 1, Robert Baud 2, Pierre Zweigenbaum 3, Magnus Merkel 4, Lars Borin 5, Stefan Schulz 1 1 Freiburg University.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment RANLP, Borovets, 2007 Eelco Mossel University of Hamburg.
Multilingual Access to Biomedical Documents Stefan Schulz, Philipp Daumke Institute of Medical Biometry and Medical Informatics University Medical Center.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automated coding of diagnoses - three methods compared Presenter.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Stefan Schulz, Kornél Markó, Philipp Daumke, Udo Hahn, Susanne Hanser, Percy Nohama, Roosewelt Leite de Andrade, Edson Pacheco, Martin Romacker Semantic.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Web- and Multimedia-based Information Systems Lecture 2.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Detection of underspecifications in SNOMED CT concept definitions using language processing 1 Federal Technical University of Paraná (UTFPR), Curitiba,
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Annual Review, Brussels March XX, 2006 SemanticMining No Annual Review NoE No Semantic Interoperability and Data Mining in Biomedicine WP20.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Acquisition of Character Translation Rules for Supporting SNOMED CT Localizations Jose Antonio Miñarro-Giménez a Johannes Hellrich b Stefan Schulz a a.
Large-Scale Evaluation of a Medical Cross- Language Information Retrieval System Kornél Markó 1,2, Philipp Daumke 1,2, Stefan Schulz 2, Rüdiger Klar 2,
Assessing SNOMED CT for Large Scale eHealth Deployments in the EU Workpackage 2- Building new Evidence Daniel Karlsson, Linköping University Stefan Schulz,
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Multilingual Medical Lexicon
Multilingual Biomedical Dictionary
Token generation - stemming
Morphoogle - A Multilingual Interface to a Web Search Engine
Presentation transcript:

Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital, Medical Informatics Department, Germany Jena University, Language & Information Engineering (JULIE), Germany

„Korrelation von Hypertonie und Läsion der Weißen Substanz“ Multilingual Textretrieval

„Korrelation von Hypertonie und Läsion der Weißen Substanz“ Search Engine „Correlation of high blood pressure and lesion of the white substance“ Multilingual Textretrieval

„Korrelation von Hypertonie und Läsion der Weißen Substanz“ Search Engine „Correlation of high blood pressure and lesion of the white substance“ Multilingual Textretrieval

„Korrelation von Hypertonie und Läsion der Weißen Substanz“ Search Engine „Correlation of high blood pressure and lesion of the white substance“ Multilingual Textretrieval

Linguistic Phenomena Morphological processes: –Inflection: leukocyte <> leukozytes, appendix <> appendices –Derivation: leukocyte <> leukocytic –Composition: leuk|em|ia, para|sympath|ectomy, Magen|schleim|haut|entzünd|ung Synonymy: –ascorbic acid <> vitamin C, hemorrhage <> bleeding Spelling variants: – oesophagus <> esophagus, – Karzinom <> Carcinom <> Carzinom (carcinoma)

Subword Approach Subwords are atomic, conceptual or linguistic units: –Stems: stomach, gastr, diaphys –Prefixes: anti-, bi-, hyper- –Suffixes: -ary, -ion, -itis –Infixes: -o-, -s- Equivalence classes contain synonymous subwords and their translations in a thesaurus: #female = { woman, women, female, frau, weib, mulher }

Morphosaurus Subword-Lexicon: –Organizes subwords in several languages (English, German, Portuguese) Subword-Thesaurus: –Groups synonymous subwords (within and between languages) Subword-Segmenter: –Extraction of Subwords and Assignment of Equivalence Classes Morphosaurus- Identifier (MID) Morphosaurus

Example high tsh value s suggest the diagnos is of primar y hypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Segmenter Subword Lexicon High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose... high tsh values suggest the diagnosis of primary hypo- thyroidism... erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose... Orthografic Rules Orthographic Normalization #up tsh #value #suggest #diagnost #primar #small #thyre Interlingua #up tsh #value #permit #diagnost #primar #small #thyre Subword Thesaurus Semantic Normalization

Example high tsh value s suggest the diagnos is of primar y hypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Segmenter Subword Lexicon High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose... high tsh values suggest the diagnosis of primary hypo- thyroidism... erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose... Orthografic Rules Orthographic Normalization #up tsh #value #suggest #diagnost #primar #small #thyre Interlingua #up tsh #value #permit #diagnost #primar #small #thyre Subword Thesaurus Semantic Normalization

Morphosaurus Search

„Korrelation von Hypertonie und Läsion der Weißen Substanz“

Morphosaurus Search „Korrelation von Hypertonie und Läsion der Weißen Substanz“ „#correl #hyper #tens #lesion #whit #matter“

Morphosaurus Search „Korrelation von Hypertonie und Läsion der Weißen Substanz“ „#correl #hyper #tens #lesion #whit #matter“ Search Engine

Morphosaurus Search „Korrelation von Hypertonie und Läsion der Weißen Substanz“ Search Engine „#correl #hyper #tens #lesion #whit #matter“

Automatic Language Acquisition Automatic Acquisition of Spanish and Swedish subword lexicons Step 1: Generation of cognate seed lexicons: –Automatic generation of cognate subword candidates Spanish cognates from Portuguese Swedish cognates from English and German –Selection of subword candidates –Semantic mapping (linkage to equivalence classes) –Validation of semantic mappings Step 2: Use cognate lexicons as a seed for iteratively learning non-cognates

Step 1: Cognate Acquisition Resources for cognate acquisition for Spanish (from Portuguese) and Swedish (from English and German): –Portuguese (~14,000 stems), German and English (~22,000 stems each) subword lexicons –Medical corpora for Portuguese, German, English, Spanish and Swedish acquired from the Web –Word frequency lists generated from these corpora –Manually created list of Spanish and Swedish affixes

Generating Cognate Candidates... estomag mulher... List of Portuguese Subwords (14,004 stems):

Generating Cognate Candidates... estomag mulher... List of Portuguese Subwords (14,004 stems): Rule (Port. » Span.) Portuguese example Spanish example English equivalent qua » cuaquadrcuadrframe eia » enaveiavenavein ss » sfracassfracasfail lh » jmulhermujerwoman l » lllevllevtake i » yensaiensaytrial f » hformighormigant... Application of 44 string substitution rules

Generating Cognate Candidates... mulher... List of Portuguese Subwords (14,004 stems): Application of 44 string substitution rules mulher muller mujer mulhier mullier mujier

Selecting Cognate Candidates... mulher... List of Portuguese Subwords (14,004 stems): mulher muller mujer mulhier mullier mujier... mulher 10/m muller 23/m mujer 50/m... mulher 45/n... Word frequency lists derived from unrelated corpora: Portuguese (size = n) Spanish (size = m) m ~ n Comparison between word frequency lists: – Elimination of non-matching subwords – Choose that cognate alternative with the most similar corpus frequency

Selecting Cognate Candidates... mulher... List of Portuguese Subwords (14,004 stems): mulher muller mujer mulhier mullier mujier... mulher 10/m muller 23/m mujer 50/m... mulher 45/n... Word frequency lists derived from unrelated corpora: Portuguese (size = n) Spanish (size = m) m ~ n Comparison between word frequency lists: – Elimination of non-matching subwords – Choose that cognate alternative with the most similar corpus frequency

Semantic Mapping mulhermujer #female = { woman, women, female, frau, weib, mulher, mujer }

Semantic Mapping mulhermujer #female = { woman, women, female, frau, weib, mulher, mujer } Language PairSource LexiconSelected Cognates Portuguese- Spanish 14,0048,644 German-Swedish English-Swedish 21,705 21,501 6,086

Cognate Validation Use parallel corpora to identify false friends: -Portuguese crianc (child)  Spanish crianz (breed) -Portuguese crianc (child)  Spanish nin (child) -Portuguese criac (breed)  Spanish crianz (breed) UMLS Metathesaurus -Contains over 2M medical terms and phrases, aligned in various languages -English has the broadest coverage English-Spanish: 60,526 alignments English-Swedish: 10,953 alignmens English-Spanish Examples -„Cell Growth“  „Crecimiento Celular“ -„Heart transplantation, with or without recipient cardiectomy“  „Transplante cardiaco, con o sin cardiectomia en el receptor.“

Cognate Validation Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid Candidates that never matched this procedure are discarded

Cognate Validation „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid Candidates that never matched this procedure are discarded

Cognate Validation „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall (port. pared) |abdomin|al|#abdom (port. abdomin) Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid Candidates that never matched this procedure are discarded

Cognate Validation „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall (port. pared) |abdomin|al|#abdom (port. abdomin) Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid Candidates that never matched this procedure are discarded

Cognate Validation „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall (port. pared) |abdomin|al|#abdom (port. abdomin) Use generated cognate seed lexicons to process the UMLS alignments with the Morphosaurus system Whenever a MID co-occurs on both sides of an alignment unit, the lexicon entry that led to that particular MID is taken to be valid Candidates that never matched this procedure are discarded

Cognate Validation „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall (port. pared) |abdomin|al|#abdom (port. abdomin) Language PairHypothesesValid English-Spanish8,6443,230 (37%) English-Swedish6,0861,565 (26%)

Step 2: Bootstrapping „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall (port. pared) |abdomin|al|#abdom (port. abdomin) Bootstrapping dictionaries using validated cognate seed lexicons and parallel corpora for acquiring non-cognates

For every alignment in the UMLS do Bootstrapping Algorithm „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall |abdomin|al|#abdom

Bootstrapping Algorithm „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall |abdomin|al|#abdom For every alignment in the UMLS do If there is exactly one invalid segmentation in target language

Bootstrapping Algorithm „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall |abdomin|al|#abdom For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language

Bootstrapping Algorithm „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia| |pared| #wall |abdomin|al|#abdom For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target

Bootstrapping Algorithm „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia||cirug|ia| |pared| #wall |abdomin|al|#abdom For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes

Bootstrapping Algorithm „Abdominal wall procedure“: |abdomin|al| #abdom |wall| #wall |proced|ure| #operat „Cirugia de la pared abdominal“: |cirug|ia||cirug|ia| |pared| #wall |abdomin|al|#abdom #operat = { proced, surgery, operat, prozess, operier, proced, process, metod, cirug } For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes Add new stem into target lexicon. Link it to source MID.

Bootstrapping Algorithm „Abdominal wall procedure“: „Cirugia de la pared abdominal“: For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes Add new stem into target lexicon. Link it to source MID. Repeat all until quiescence

Bootstrapping Algorithm „Abdominal wall procedure“: „Skin operations“: |skin|#derma |operat|ions| #operat „Cirugia de la pared abdominal“: „Cirugia de piel“: |cirug|ia|#operat |piel| For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes Add new stem into target lexicon. Link it to source MID. Repeat all until quiescence

Bootstrapping Algorithm „Abdominal wall procedure“: „Skin operations“: |skin|#derma |operat|ions| #operat „Cirugia de la pared abdominal“: „Cirugia de piel“: |cirug|ia|#operat |piel||piel| For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes Add new stem into target lexicon. Link it to source MID. Repeat all until quiescence #derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel }

Bootstrapping Algorithm „Abdominal wall procedure“: „Skin operations“: „Skin abnormalities“: |skin|#derma |abnorm|alities|#anomal „Cirugia de la pared abdominal“: „Cirugia de piel“: „Malformacion de la piel“: |malformation| |piel|#derma For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes Add new stem into target lexicon. Link it to source MID. Repeat all until quiescence

Bootstrapping Algorithm „Abdominal wall procedure“: „Skin operations“: „Skin abnormalities“: |skin|#derma |abnorm|alities|#anomal „Cirugia de la pared abdominal“: „Cirugia de piel“: „Malformacion de la piel“: |malformation||malform|ation| |piel|#derma For every alignment in the UMLS do If there is exactly one invalid segmentation in target language If there is exactly one more MID in source language Take supernumerary MID and invalid segmentation from target Restore invalid segmentation and strip off potential affixes Add new stem into target lexicon. Link it to source MID. Repeat all until quiescence #anormal = { abnorm, anomal, abnorm, anomal, abnorm, anomal, malform }

Bootstrapping Results Total: 7,154 Spanish and 4,148 Swedish entries acquired

Evaluation Process the English-Spanish and English-Swedish UMLS alignments with the Morphosaurus system Additionally process Spanish-Swedish UMLS alignments Measures: –Coverage: At least one MID co-occurs on both sides –Consistency A: Number of MIDs co-occuring on both sides N,M: Number of MIDs occuring on only one side –Identical Indexes

Results

Conclusion Cross-Language Document Retrieval based on the matching of search/document terms on a language-independent, interlingual layer. Significant amount of English, German, and Portuguese subwords can be mapped to Spanish and Swedish cognates using simple string substitution rules. These cognate seed lexicons are further enlarged by subword translations which are not cognates by bootstrapping and using parallel corpora. Methodology proved to be useful in a standardized CLIR experimental setting Generality of the approach: –Need of large, aligned corpora –Eurodicautum (12 languages, 5M entries), Eurovoc (13 languages), OECD, UNESCO, AGROVOC, etc.

Morphosaurus Search

Evaluation OHSUMED-Corpus (Hersh et al., 1994) –Subset of MEDLINE –~233,000 English documents –106 English user queries, additionally translated to German, Portuguese, Spanish and Swedish by medical experts –query-document pairs have been manually judged for relevance Search Engine: Lucene –

Evaluation Baseline: monolingual text retrieval –(stemmed) English user queries –(stemmed) English texts Query translation (QTR) –Google translator –Multilingual dictionary compiled from UMLS Morphosaurus Indexing (MSI) –Interlingual representation of both user queries and documents

Evaluation Results German (n = 22,385)Portuguese (n = 14,862) Top % of Baseline 60% of Baseline 78% of Baseline 52% of Baseline

Evaluation Results German (n = 22,385)Portuguese (n = 14,862) Top 200 Spanish (n = 7,154)Swedish (n = 4,148) Top % of Baseline 60% of Baseline 78% of Baseline 52% of Baseline 69% of Baseline 40% of Baseline 27% of Baseline 2% of Baseline

Semantic Mapping mulhermujer #female = { woman, women, female, frau, weib, mulher, mujer } Language PairSource LexiconSelected Cognates Linked MIDs Portuguese- Spanish 14,0048,6446,036 German-Swedish English-Swedish 21,705 21,501 4,249 4,140 3,308 3,208 Combined Swedish Evidence (set union) 6,0864,157