Multilingual Access to Biomedical Documents Stefan Schulz, Philipp Daumke Institute of Medical Biometry and Medical Informatics University Medical Center.

Slides:



Advertisements
Similar presentations
WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.
Advertisements

Complex queries in the PATENTSCOPE search system Cyberspace September 2013 Sandrine Ammann Marketing & Communications Officer.
Automatic Mapping of Clinical Documentation to SNOMED CT Holger Stenzhorn Saarland University Hospital, Homburg, Germany Edson Pacheco Percy Nohama Stefan.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Resources to Answer Questions eModule 2 LSI Curriculum, Year 1 Content Authors Stephanie Schulte, MLIS, Assistant Professor, Health Sciences Library Carol.
Sheffield at ImageCLEF 2003 Paul Clough and Mark Sanderson.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East 3057 Australia Presenter Name: Stefan Schulz Country:1. Germany, 2. Brazil Qualification(s):
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East.
Semantic Annotation for Multilingual Search Shibamouli Lahiri
Alex Wade, Microsoft Research And acknowledging Walter L. Warnick, Ph.D. Director, Office of Scientific and Technical Information U.S. Department of Energy.
Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,
Advance Information Retrieval Topics Hassan Bashiri.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
1 Intra- and interdisciplinary cross- concordances for information retrieval Philipp Mayr GESIS – Leibniz Institute for the Social Sciences, Bonn, Germany.
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
PATENTSCOPE Patent Search Strategies and Techniques Andrew Czajkowski Head, Innovation and Technology Support Section Centurion September 11, 2014.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Image Retrieval in Radiology: The ImageCLEF Challenge Charles E. Kahn, Jr. Medical College of Wisconsin Jayashree Kalpathy-Cramer Oregon Health & Science.
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
A School of Information Science, Federal University of Minas Gerais, Brazil b Medical University of Graz, Austria, c University Medical Center Freiburg,
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
Stefan Schulz, Kornél Markó, Philipp Daumke, Udo Hahn, Susanne Hanser, Percy Nohama, Roosewelt Leite de Andrade, Edson Pacheco, Martin Romacker Semantic.
WorldWideScience.org: An International Knowledge-Sharing Model Brian A. Hitson Office of Scientific & Technical Information U.S. Department of Energy.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Customization in the PATENTSCOPE search system Cyberworld November 2013 Sandrine Ammann Marketing & communications officer.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Multilingual Search Shibamouli Lahiri
Detection of underspecifications in SNOMED CT concept definitions using language processing 1 Federal Technical University of Paraná (UTFPR), Curitiba,
Cross Language Information Exploitation of Arabic Dr. Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Manual Query Modification and Automated Translation to Improve Cross-language Medical Image Retrieval Jeffery R. Jensen William R. Hersh Department of.
Large-Scale Evaluation of a Medical Cross- Language Information Retrieval System Kornél Markó 1,2, Philipp Daumke 1,2, Stefan Schulz 2, Rüdiger Klar 2,
Assessing SNOMED CT for Large Scale eHealth Deployments in the EU Workpackage 2- Building new Evidence Daniel Karlsson, Linköping University Stefan Schulz,
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Stefan Schulz Medical Informatics Research Group
Multilingual Biomedical Dictionary
What is “Biomedical Informatics”?
Morphoogle - A Multilingual Interface to a Web Search Engine
CSE 635 Multimedia Information Retrieval
Presentation transcript:

Multilingual Access to Biomedical Documents Stefan Schulz, Philipp Daumke Institute of Medical Biometry and Medical Informatics University Medical Center Freiburg, Germany Averbis GmbH, Freiburg, Germany

Cross-language document retrieval in life sciences and health care The technique of morphosemantic document indexing Evaluation of morphosemantic indexing Cross-language document retrieval in practice

Cross-language document retrieval in life sciences and health care The technique of morphosemantic document indexing Evaluation of morphosemantic indexing Cross-language document retrieval in practice

Biomedical Documents Design Accessibility Bridging the gap between user groups and information sources Electronic Patient Records Textbook Information Product Information Experimental Reports Scientific Publications Websites Laypersons’ language Experts’ language Subdomain-related jargon Language in the lab / in scientific publication Language of the health record Writer’s / Reader’s native / 2 nd / foreign language Health Professionals Researchers Sales / Marketing Consumers User Interface Heterogeneous User Groups Language Variability Heterogeneous Document Types Biomedical Document Retrieval

The Status of English Language English as First Language English as Second Language No English Language Skills English as a Foreign Language < 70 % of the world's scientists read in English 80 % of the world's electronically stored information is in English 90 % English articles in Medline (2000) Sources: The British Council, 2005 Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008

Non-native speakers Broad range of command of English Reading skills > writing skills Reduced active vocabulary Difficulty in formulating precise queries English as Second Language English as a Foreign Language

Cross language information retrieval (CLIR) deals with retrieving information written in a language different from the language of the user's query Benefit for multilingual users –Avoid multiple queries –Formulate a query in their preferred language Monolingual users take advantage –if their passive knowledge is sufficient to understand documents in a foreign language –If (automatic) translation can be performed –If image captions are used to search for images Cross-language IR

Mixed document collections (in different languages) Countries with more than one official language (e.g. Switzerland, Canada, Belgium, Spain…) Document handsearching, e.g. Freiburg Cochrane Collaboration project (since 1995) –Identification of 21,620 controlled clinical trials –83% not listed in MEDLINE as „controlled trial“ –30% not indexed in MEDLINE –30% not in English language CLIR Beyond Native -> English

Example Korrelation von Hypertonie und Läsion der Weißen Substanz…

Korrelation von Hypertonie und Läsion der Weißen Substanz… “Correlation of high blood pressure and lesion of the white substance” Example

Korrelation von Hypertonie und Läsion der Weißen Substanz… “Correlation of high blood pressure and lesion of the white substance” Example

Korrelation von Hypertonie und Läsion der Weißen Substanz… “Correlation of high blood pressure and lesion of the white substance” Example

Cross-language document retrieval in life sciences and health care The technique of morphosemantic document indexing Evaluation of morphosemantic indexing Cross-language document retrieval in practice

Hypotheses The true, significant elements of language are... either words, significant parts of words, or word groupings. [Sapir 1921] Linguistic variations make (medical) Information Retrieval difficult Levels of linguistic variations Morphology Syntax Lexico-Semantics

Morphosemantic Indexing Subword-based, multilingual semantic indexing for document retrieval Subwords are atomic, conceptual or linguistic units: –Stems: stomach, gastr, diaphys –Prefixes: anti-, bi-, hyper- –Suffixes: -ary, -ion, -itis –Infixes: -o-, -s- Equivalence classes contain synonymous subwords and their translations: –#derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } –#inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog,... }

Segmentation: Myo | kard | itis Herz | muskel | entzünd |ung Inflamm |ation of the heart muscle muscle myo muskel muscul inflamm -itis inflam entzünd Eq Class subword herz heart card corazon card INFLAMM MUSCLE HEART Subword Thesaurus Structure Indexation: #muscle #heart #inflamm #heart #muscle #inflamm #inflamm #heart #muscle Thesaurus: ~ equivalence classes (MIDs) Lexicon entries: –English:~ –German:~ –Portuguese:~ –Spanish:~ –French:~ –Swedish:~ –Italian:~ 4.000

Indexing Pipeline

Subword-based document transformation Morphosemantic indexer

Subword-Based Search Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

Subword-based query transformation Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

Cross-language document retrieval in life sciences and health care The technique of morphosemantic document indexing Evaluation of morphosemantic indexing Cross-language document retrieval in practice

Evaluation Gold standards: OHSUMED, ImageCLEFMed –OHSUMED-Corpus (Hersh et al., 1994) Subset of MEDLINE ~233,000 English documents 106 English user queries –ImageCLEFMed Corpus (Clough et al., 2005) Multilingual Image Retrieval Task 2006 ~ Medical Images and captions 30 queries Query-document pairs had been manually judged for relevance Non-English queries were obtained by translation to German, Portuguese, Spanish and Swedish by domain experts Search Engine: Lucene –

Evaluation Baseline: monolingual text retrieval –(stemmed) English user queries –(stemmed) English texts Query translation (QTR) –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing (MSI) –Interlingual representation of both user queries and documents –MSI-D incorporating disambiguation module

OHSUMED Benchmark Baseline: monolingual –Stemmed English queries –Stemmed English texts Query translation –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing –Interlingual representation of user queries and documents Morphosemantic Indexing –incorporating disambiguation module English German Portuguese Spanish French Swedish Average Percent of Baseline Mean Average Precision

ImageCLEFMed Benchmark Baseline: monolingual –Stemmed English queries –Stemmed English texts Query translation –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing –Interlingual representation of user queries and documents Morphosemantic Indexing –incorporating disambiguation module English German Portuguese Spanish French Swedish Average Percent of Baseline Top 20 Average Precision

Summary Cross-Language Document Retrieval –Based on morphological and semantic normalization of both user queries and documents –Matching of search/document terms on a language-independent, interlingual layer Language-independent indexing –reaches more than 92% of an English-English baseline on heterogeneous document collections, on average –outperforms query translation significantly –is independent from particular search engine architectures Incorporates six languages: –German, English, Portuguese, Spanish, French, Swedish In use in commercial systems

Cross-language document retrieval in life sciences and health care The technique of morphosemantic document indexing Evaluation of morphosemantic indexing Cross-language document retrieval in practice

Zentralbibliothek für Medizin Largest European medical library ~20 M Database entries 60,000 Queries / Month Subword-based cross-language retrieval

Contact Prof. Dr. med. Stefan Schulz Institute of Medical Biometry and Medical Informatics University Medical Center Freiburg, Germany Dr. med. Philipp Daumke Averbis GmbH Freiburg, Germany