SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004.

Slides:



Advertisements
Similar presentations
Simplifications of Context-Free Grammars
Advertisements

Chapter 7 System Models.
Requirements Engineering Process
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group
Terminology work at the European Central Bank
1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić
1 EnviroInfo 2006, 05/09/06 Graz Automatic Concept Space Generation in Support of Resource Discovery in Spatial Data Infrastructures Paul Smits, Anders.
Crosslingual Ontology-Based Document Retrieval (Search) in an eLearning Environment Eelco Mossel LSP 2007, Hamburg.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
WIPO Patent Information Services
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt Time Money AdditionSubtraction.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Conceptual / semantic modelling
WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.
Programming Language Concepts
Profiles Construction Eclipse ECESIS Project Construction of Complex UML Profiles UPM ETSI Telecomunicación Ciudad Universitaria s/n Madrid 28040,
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Richmond House, Liverpool (1) 26 th January 2004.
Intro to LPA Feb 11©2011 EDAC All Rights ReservedSlide 1 The Executive Development Assessment Centre Introduction to the LPA February 2011.
Break Time Remaining 10:00.
PP Test Review Sections 6-1 to 6-6
PubMed Searching: Automatic Term Mapping (ATM) PubMed for Trainers, Spring 2014 U.S. National Library of Medicine (NLM) and NLM Training Center.
Chapter 6 Data Design.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Chapter 5: Query Operations Hassan Bashiri April
1..
Machine Translation II How MT works Modes of use.
Traditional IR models Jian-Yun Nie.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 4 Slide 1 Software processes 2.
Chapter 5 Test Review Sections 5-1 through 5-4.
Advanced Knowledge Modeling Additional domain constructs Domain-knowledge sharing and reuse Catalog of inferences Flexible use of task methods.
Addition 1’s to 20.
25 seconds left…...
Week 1.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Clock will move after 1 minute
PSSA Preparation.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East.
Towards a Multilingual Medical Lexicon Kornél Markó 1, Robert Baud 2, Pierre Zweigenbaum 3, Magnus Merkel 4, Lars Borin 5, Stefan Schulz 1 1 Freiburg University.
Automatic Lexicon Acquisition for a Medical Cross-Language Information Retrieval System Kornél Markó, Stefan Schulz, Udo Hahn Freiburg University Hospital,
Multilingual Access to Biomedical Documents Stefan Schulz, Philipp Daumke Institute of Medical Biometry and Medical Informatics University Medical Center.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Stefan Schulz, Kornél Markó, Philipp Daumke, Udo Hahn, Susanne Hanser, Percy Nohama, Roosewelt Leite de Andrade, Edson Pacheco, Martin Romacker Semantic.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Annual Review, Brussels March XX, 2006 SemanticMining No Annual Review NoE No Semantic Interoperability and Data Mining in Biomedicine WP20.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Multilingual Medical Lexicon
Multilingual Biomedical Dictionary
Morphoogle - A Multilingual Interface to a Web Search Engine
Presentation transcript:

SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Agenda March 29 12: :30 Lunch 13:30Welcome, dicussion of agenda 14: :35 Linköping presentation 14: :10 Brighton presentation 15: :45 Göteborg presentation 15: :30 Coffee break 16: :05 Stockholm presentation 17: :40 Geneva presentation 17: :25 Paris presentation 18: :00 Freiburg presentation 20:00 Dinner March 30 9: :30 Discussion of the description of WP :30 – 10:45 coffee break 11:00-12:45 Workplan for WP20 Discussion and elaboration of deliverables 13:00-14:00 Lunch

Multi-lingual Medical Dictionary Description of Work (I) The lack of a large-scale multi-lingual medical dictionary hampers the integration of European research activities in the medical field, and more seriously also the development of multi-lingual information retrieval services. An interesting language technology useful for this problem is corpus-based machine translation. The aim of this project is to develop techniques and systems for lexical data generation from parallel corpora, and to develop and apply methods for evaluation of machine translation systems. Parallel corpora exist e.g. as translations from English to other European languages of the official WHO classifications and some other terminology systems. Several of the NoE partners have extensive experience in multilingual lexical resources and computational lexicography, while others have an interest in applying such tools e.g. for semi-automated translation, semi-automated coding and indexing, and advanced systems for information retrieval.

Tasks 20.1 Facilitating short study visits of members of each others groups 20.2 Sharing and exchange of methods, materials and collaboration on work in progress 20.3 Proposal for a common data structure for a multi-lingual medical dictionary 20.4 Generation of multi-lingual medical lexicon in English, German, French, Portuguese, Italian, Spanish, Swedish in a range of entries per language Deliverables D20.1 Report Multi-lingual Medical Dictionary m11 D20.2 Report Multi-lingual Medical Dictionary m17 Multi-lingual Medical Dictionary Description of Work (II)

Topics for Discussion Lexeme features (morphology, syntax, semantics) Application context (IR, NLG, …) Linguistic framework (grammar theory) Languages covered Domain (sublanguages, general language) Size of the lexicon Implementation framework (sources, exchange templates, Interfaces to terminological resources (UMLS, WordNet) Methods for lexical acquisition (manual, semi-automatic)

MorphoSaurus Subword Lexicon & Thesaurus Freiburg University Hospital Department of Medical Informatics Freiburg University Computational Linguistics Lab

Motivation – Intra- and Crosslingual Indexing for Information Retrieval Requirements: Elimination of inflectional e derivational variation: {nucleus,nuclei}, {diagnosis,diagnoses,diagnostic} {foot, feet}, {Lymphozyten, lymphozytär} Decomposition of compound terms: procto|sigmoid|o|scop|ie, para|sympath|ectomy, Rechts|herz|insuffizienz, psic|o|s|somát|ic|o Resolution of Synonyms and Spelling Variants: {oesophagus, esophagus}, {leuko, leuco}, {cutis, skin},{hemorrhage,bleeding}, {ascorbic,Vitamin C, {ancylostoma, hookworm} Mapping of interlingual synonyms: {blood, blut, sangue}, {liver, hepat..., fígado} {kidney, nephr.., nefr.., nier.., ren, rim, },

What is a subword ? An atomic linguistic sense unit: Morphemes: nephr, anti, thyr, scler, hepat, cardi Morpheme aggregates: diaphys, ascorb, anabol, diagnost Words: amyloid, bone, fever, liver exceptionally: noun groups: vitamin c,… Taming the growth rates of lexical resources at a sublinear level

Subword Delimitation Criteria Semantic (compositionality) Hyper | cholesterol | emia Lexical (enabling synonym matching) schleimhaut = mucosa (schleim | haut) Data-driven (avoiding ambiguities and false segmentation), e.g. relationship, schwangerschaft (relation|ship, schwanger|schaft)

The MorphoSaurus system Extracts semantically relevant subwords from medical texts in different language Transforms IR relevant content to concept- like semantic identifiers. (MID = MorphoSaurus identifiers)

Example: High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo- thyreose...

Example: High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo- thyreose... high tsh values suggest the diagnosis of primary hypo- thyroidism... erhoehte tsh werte erlauben die diagnose einer primaeren hypo- thyreose... Orthographic Rules Orthographic Normalization

Example: high tsh value s suggest the diagnos is of primar y hypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Morphosyntactic Parser Lexicon High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo- thyreose... high tsh values suggest the diagnosis of primary hypo- thyroidism... erhoehte tsh-werte erlauben die diagnose einer primaeren hypo- thyreose... Orthographic Rules Orthographic Normalization

Example: high tsh value s suggest the diagnos is of primar y hypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Morphosyntactic Parser Lexicon High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo- thyreose... high tsh values suggest the diagnosis of primary hypo- thyroidism... erhoehte tsh-werte erlauben die diagnose einer primaeren hypo- thyreose... Orthographic Rules Orthographic Normalization upiiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw MID-Representation upiiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw Thesaurus Semantic Normalization

Example: high tsh value s suggest the diagnos is of primar y hypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose Morphosyntactic Parser Lexicon High TSH values suggest the diagnosis of primary hypo- thyroidism... Original Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo- thyreose... high tsh values suggest the diagnosis of primary hypo- thyroidism... erhoehte tsh-werte erlauben die diagnose einer primaeren hypo- thyreose... Orthographic Rules Orthographic Normalization upiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw MID-Representation upiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw Thesaurus Semantic Normalization

Morphosaurus Thesaurus Features Only two semantic relations: Syntagmatical expansion: nephrotomiiqwjja = nephriikwjza + tomyiiqjqqa (To avoid known mis-segmentations, e.g. nephr + oto + mie) Ambiguous readings: seitiiyqyqa = lateraliijwira OR pagerijjrja Transforms IR relevant content to concept-like semantic identifiers. (MID = MorphoSaurus identifiers)

Morphoedit Lexicon Editor

State of the Project Domain: clinical language and lay expressions, partly Validated entries: 21,397 English, 22,053 German, 15,029 Portuguese. Automatically generated entries 8,992 Spanish subwords from Portuguese subwords

CLIR Experiments (OHSUMED) Manual translation of 106 English queries to German and Portuguese by medical experts Baseline: machine translation/bilingual dictionaries QTR Google-Translator to re-translate German/Portuguese queries to English additional search in a bilingual lexeme dictionary, derived from the UMLS-Metathesaurus. stemmed by the Porter stemming algorithm / stop word elimination MorphoSaurus: normalization of queries/documents MSI Boolean search engine: frequency and adjacency measure Results German:QTR: 68%, MSI: 93% Results Portuguese: QTR: 54%, MSI: 62% (RIAO04)

Multilingual MeSH Mapping Morpho-semantic normalization of 35,000 English, manual MeSH annotated Medline abstracts Statistical learning of indexing patterns Using indexing patterns for mapping of normalized English/German/Portuguese texts Results:gold standard human indexers English:33% (68%) German:30% (62%) Portuguese:27% (56%) (RIAO04) agreement with