© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul.

Slides:



Advertisements
Similar presentations
26./27. Juni 2006 Saarbrücken Workshop on multilingual semantic annotation, Saarbrücken, 26/ Comments on Emanuele Pianta: Exploiting Parallel Texts.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
U. S. National Library of Medicine NLM Indexing Initiative Tools for NLP: MetaMap and the Medical Text Indexer Natural Language Processing: State of the.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
© Paul Buitelaar: eJustice Presentation, July 15th, 2004 Ontologies Contributions from Language Technology Paul Buitelaar DFKI GmbH Language Techology.
© Paul Buitelaar: KnowledgeWeb Summer School, Spain - July 2004 Human Language Technology in Ontology Engineering Ontology Learning from Text Paul Buitelaar.
Language Technology for the Semantic Web OntoWeb/AgentLink, Barcelona: February 4 th,2003 OntoWeb SIG5 Language Technology in.
Semantic Annotation for Multilingual Search Shibamouli Lahiri
The Role of the UMLS in Vocabulary Control CENDI Conference “Controlled Vocabulary and the Internet” Stuart J. Nelson, MD.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
April 19 th,2002 MuchMore Project Review Multilingual Concept Hierarchies for Medical Information Organization and Retrieval MUCHMORE.
Crosslingual Retrieval in an eLearning Environment Cristina Vertan, Kiril Simov, Petya Osenova, Lothar Lemnitzer, Alex Killing, Diane Evans, Paola Monachesi.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Generating topic chains and topic views: Experiments using GermaNet Irene Cramer, Marc Finthammer, and Angelika Storrer Faculty.
Grace CHENG Lewis CHOI Knowledge Management Unit Hospital Authority Leveraging Knowledge from Clinical Guidelines through Information Technologies.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains Patrice Lopez and Laurent Romary INRIA & HUB – IDSL
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2005 May 16 & 17, 2005 Rachel Kleinsorge.
Session II: Scientific Publishing and Semantic Web W3C Semantic Web for Life Sciences Workshop October 27, 2004 Moderator: Alan R. Aronson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Learning a token classification from a large corpus (A case study in abbreviations) Petya Osenova & Kiril Simov BulTreeBank Project (
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
This material was developed by Duke University, funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information.
Health IT Workforce Curriculum Version 1.0 Fall Networking and Health Information Exchange Unit 4c Basic Health Data Standards Component 9/Unit.
CTAKES The clinical Text Analysis and Knowledge Extraction System.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Use of the UMLS in Patient Care James J. Cimino, M.D. Center for Medical Informatics Columbia University.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Tokenization & POS-Tagging
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
CLEF Kerkyra Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Arantxa Otegi UNIPD: Giorgio Di Nunzio UH: Thomas Mandl.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
July 2002, DI Colloquium Semantic Annotation for Semantic Indexing Paul Buitelaar, Martin VolkMuchMore DFKI Language Technology Saarbrücken, Germany Eurospider.
Consumer Health Question Answering Systems Rohit Chandra Sourabh Singh
Assessing SNOMED CT for Large Scale eHealth Deployments in the EU Workpackage 2- Building new Evidence Daniel Karlsson, Linköping University Stefan Schulz,
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Presentation transcript:

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Overview MuchMore Objectives Semantic Annotation  Semantic Resources, Term/Relation Tagging Corpus Annotation  Part-of-Speech, Morphology, Chunks  Grammatical Functions Annotation Format (DTD), Examples, Demo

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain  Establishing a Baseline with Corpus-Based Methods  Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Semantic Resources General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”) Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI UMLS C |ENG|P|L |PF|S |HIV|0| C |ENG|S|L |PF|S |HTLV-III|0| C |ENG|S|L |VS|S |Human Immunodeficiency Virus|0| C |ENG|S|L |VWS|S |Virus, Human Immunodeficiency|0| C |FIN|P|L |PF|S |HIV|3| C |FRE|P|L |PF|S |HIV|3| C |FRE|S|L |PF|S |VIRUS IMMUNODEFICIENCE HUMAINE|3| C |GER|P|L |PF|S |HIV|3| C |GER|S|L |PF|S |Humanes T-Zell-lymphotropes Virus Typ III|3| other languagesGERMAN 66,381ENGLISH 1.462,202 Concept Names (MRCON): 1.734,706 Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) Clozapine : C  Pharmacologic Substance : T121 Semantic Types are organized in a Network through 54 Relations T121|T154|T047

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Term / Relation Tagging Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <termid="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C " tui="T073"/> Annotate All Possible Semantic Relations between Identified Terms within a Sentence <termid="2" tokenid="2” preferred="Heparinoid” cui="C ” tui="T121"/> <termid="5" tokenid="6" preferred ="Thrombin” cui="C " tui="T126"/>

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Corpus Annotation Morpho/Syntactic Processing –TnTTokenization, Segmentation, PoS-tagging –MmorphLemmatization (German compound analysis) –ChunkiePhrase Recognition –under developmentGrammatical Function Tagging Parallel Corpus –~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) –~ 1 M Tokens for each Language –Manual Clean-Up

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Tokenization, POS Tagging Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon »Specialist Medical Lexicon  UMLS (Englisch), ZInfo (German)

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Morphology, Phrase Recognition Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. –Schleimhautoedem > Schleimhaut+Oe+Dem »German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Grammatical Function Tagging Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation”

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation Format (DTD) document keywordssentencetitle keywordewntermstermssemrelstextgramrelschunksewntermstermssemrelstextgramrelschunks ewntermtermsemreltokengramrelchunkewntermtermsemreltokengramrelchunk

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation (Example) A A 34-year-old 34-year-old HIV-infected HIV-infected African African woman woman developed developed fever fever and and weight weight loss loss on on her her trunk trunk and and arms arms.. </document>

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Demo...