WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

European Thesaurus on International Relations and Area Studies A multilingual terminological tool on international affairs Axel Huckstorf Stiftung Wissenschaft.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
WIPO Patent Information Services
SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004.
Contents The Gentt Group The concept of text genre as the core of the project Research objectives Methodology Phases of the Gentt Project Main results.
JRC-Ispra, , Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Complex queries in the PATENTSCOPE search system Cyberspace September 2013 Sandrine Ammann Marketing & Communications Officer.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Automatic Mapping of Clinical Documentation to SNOMED CT Holger Stenzhorn Saarland University Hospital, Homburg, Germany Edson Pacheco Percy Nohama Stefan.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas.
Sheffield at ImageCLEF 2003 Paul Clough and Mark Sanderson.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East 3057 Australia Presenter Name: Stefan Schulz Country:1. Germany, 2. Brazil Qualification(s):
Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East.
Presentation Title Presentation Subtitle and/or Conference Name Place Day Month Year First Name Last Name Job Title.
Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,
Multilingual Access to Biomedical Documents Stefan Schulz, Philipp Daumke Institute of Medical Biometry and Medical Informatics University Medical Center.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Using corpora for bespoke language teaching
1 The Domain-Specific Track at CLEF 2008 Vivien Petras & Stefan Baerisch GESIS Social Science Information Centre, Bonn, Germany Aarhus, Denmark, September.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
PATENTSCOPE Patent Search Strategies and Techniques Andrew Czajkowski Head, Innovation and Technology Support Section Centurion September 11, 2014.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
The European Thesaurus on International Relations and Area Studies A Multilingual Resource for Indexing, Retrieval, and Translation SWP Michael Kluck and.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Food and Agriculture Organization of the UN Library and Documentation Systems Division GILW FAO's activities on Thesauri and Terminology Systems.
1 Caselex: an e-Government Application favouring Interoperability Roberta Nannucci ITTIG/CNR Supported by the European Commission.
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
Stefan Schulz, Kornél Markó, Philipp Daumke, Udo Hahn, Susanne Hanser, Percy Nohama, Roosewelt Leite de Andrade, Edson Pacheco, Martin Romacker Semantic.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
1 INTEGRATION OF THE TEXTUAL DATA FOR INFORMATION RETRIEVAL : RE-USE THE LINGUISTIC INFORMATION OF VICINITY Omar LAROUK ELICO -ENS SIB University of Lyon-France.
Translating Dialects in Search: Mapping between Specialized Languages of Discourse and Documentary Languages Vivien Petras UC Berkeley School of Information.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Customization in the PATENTSCOPE search system Cyberworld November 2013 Sandrine Ammann Marketing & communications officer.
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
Terminology Components for Ecoinformatics Sharing Gail Hodge Consultant to USGS BIO/NBII Information International Associates, Inc. 28 January 2004 science.
Acceso a la información mediante exploración de sintagmas Anselmo Peñas, Julio Gonzalo y Felisa Verdejo Dpto. Lenguajes y Sistemas Informáticos UNED III.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim
Annual Review, Brussels March XX, 2006 SemanticMining No Annual Review NoE No Semantic Interoperability and Data Mining in Biomedicine WP20.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch & Max Stempfhuber GESIS Social Science Information Centre, Bonn, Germany Budapest,
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
CLIR PATENTSCOPE search system Cyberworld February 2016 Sandrine Ammann Marketing & Communications Officer.
PATENTSCOPE Patent Search Strategies and Techniques Andrew Czajkowski Head, Innovation and Technology Support Section.
Large-Scale Evaluation of a Medical Cross- Language Information Retrieval System Kornél Markó 1,2, Philipp Daumke 1,2, Stefan Schulz 2, Rüdiger Klar 2,
Language Identification and Part-of-Speech Tagging
Assessing SNOMED CT for Large Scale eHealth Deployments in the EU Workpackage 2- Building new Evidence Daniel Karlsson, Linköping University Stefan Schulz,
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CLIR PATENTSCOPE search system
Multilingual Biomedical Dictionary
CLIR PATENTSCOPE search system
Morphoogle - A Multilingual Interface to a Web Search Engine
Large scale multilingual and multimodal integration
Statistics Explained goes multilingual
Presentation transcript:

WP 10 Multilingual Access Philipp Daumke, Stefan Schulz

Multilingual Access - Rationale English as First Language English as Second Language No English Language Skills English as a Foreign Language < 70 % of the world's scientists read in English 80 % of the world's electronically stored information is in English 90 % English articles in Medline (2000) Sources: The British Council, 2005 Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008

Non-native speakers Broad range of command of English Reading skills > writing skills Reduced active vocabulary Difficulty in formulating precise queries English as Second Language English as a Foreign Language

Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example

Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example

Korrelation von Hypertonie und Läsion der Weißen Substanz… Correlation of high blood pressure and lesion of the white substance Cross-language document retrieval example

BootStrep WP 10 - Multilingual access Objectives: –To provide a multilingual search interface to the BootStrep Biolexicon / Bioontology –We do NOT propose to deliver a multilingual extension of the BootStrep biolexicon Query Languages: French, German, English, (Italian) Output language: English Method: Subword-based semantic indexing Resources: –MorphoSaurus multilingual subword lexicon & thesaurus –MorphoSaurus Semantic Indexer

Technique: Morphosemantic Indexing Subword-based, multilingual semantic indexing for document retrieval Subwords are atomic, conceptual or linguistic units: –Stems: stomach, gastr, diaphys –Prefixes: anti-, bi-, hyper- –Suffixes: -ary, -ion, -itis –Infixes: -o-, -s- Equivalence classes contain synonymous subwords and their translations: –#derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … } –#inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog,... }

Segmentation: Myo | kard | itis Herz | muskel | entzünd |ung Inflamm |ation of the heart muscle muscle myo muskel muscul inflamm -itis inflam entzünd Eq Class subword herz heart card corazon card INFLAMM MUSCLE HEART Subword Thesaurus Structure Indexation: #muscle #heart #inflamm #heart #muscle #inflamm #inflamm #heart #muscle Thesaurus: ~ equivalence classes (MIDs) Lexicon entries: –English:~ –German:~ –Portuguese: ~ –Spanish:~ –French:~ –Swedish:~ –Italian:~ 4.000

Indexing Pipeline

Subword-based document transformation Morphosemantic indexer

Subword-Based Search Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

Subword-based query transformation Korrelation von Hypertonie und Läsion der Weißen Substanz… #correl #hyper #tens #lesion #whit #matter

Adapting Morphosemantic Indexing of BootStrep BootStrep terminology mostly disjoint from existing clinical terminology Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences) BootStrep Terms for multilingual access –Gene Ontology, InterPro, IntAct, Gene Regulation Ontology, Species Medline subcorpus (about E. coli gene regulation)

Ongoing/Completed Tasks Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora (en, de, fr) Multilingual Terminology Browser –2268 GO terms + translations –6925 InterPro terms + translations –2082 IntAct terms + translations –URL: Multilingual Search Engine: –Document collection: BootStrep-Medline subset –Languages: English, German, French –Query modes: Author, Title, title + keywords, All

Terminology Browser Search Results Further Information Navigation

Terminology Browser

Multilingual Search Engine

To do: Tools and Resources BootStrep-Browser –Integration of Species –Integration of the Gene Regulation Ontology Multilingual Search Engine –Multilingual treatment of acronyms –Inclusion of species synonym list –Dealing with mixed queries (German-English, English-French) –Integration with the fact store Continue lexicon population –Italian terms ?

To do: Evaluation Creation of a gold standard –Typical English queries –Find all relevant documents in the E.coli subset CLIR experiments –Translate queries to French and German –Compare mean average precision Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)

ImageCLEFMed Benchmark Baseline: monolingual –Stemmed English queries –Stemmed English texts Query translation –Google translator –Multilingual dictionary compiled from UMLS Morphosemantic Indexing –Interlingual representation of user queries and documents Morphosemantic Indexing –incorporating disambiguation module English German Portuguese Spanish French Swedish Average Percent of Baseline Top 20 Average Precision