Download presentation
Presentation is loading. Please wait.
Published byPhilip Parker Modified over 9 years ago
1
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany
2
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Overview MuchMore Objectives Semantic Annotation Semantic Resources, Term/Relation Tagging Corpus Annotation Part-of-Speech, Morphology, Chunks Grammatical Functions Annotation Format (DTD), Examples, Demo
3
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain Establishing a Baseline with Corpus-Based Methods Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension
4
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Semantic Resources General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”) Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … 730.000 Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations
5
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI UMLS C0019682|ENG|P|L0019682|PF|S0048631|HIV|0| C0019682|ENG|S|L0020103|PF|S0049688|HTLV-III|0| C0019682|ENG|S|L0020128|VS|S0049756|Human Immunodeficiency Virus|0| C0019682|ENG|S|L0020128|VWS|S0098727|Virus, Human Immunodeficiency|0| C0019682|FIN|P|L1523437|PF|S1819346|HIV|3| C0019682|FRE|P|L0168651|PF|S0233132|HIV|3| C0019682|FRE|S|L0206547|PF|S0277133|VIRUS IMMUNODEFICIENCE HUMAINE|3| C0019682|GER|P|L0413854|PF|S0538136|HIV|3| C0019682|GER|S|L1261793|PF|S1503739|Humanes T-Zell-lymphotropes Virus Typ III|3| other languagesGERMAN 66,381ENGLISH 1.462,202 Concept Names (MRCON): 1.734,706 Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) Clozapine : C0009079 Pharmacologic Substance : T121 Semantic Types are organized in a Network through 54 Relations T121|T154|T047
6
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Term / Relation Tagging Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <termid="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C0021708" tui="T073"/> Annotate All Possible Semantic Relations between Identified Terms within a Sentence <termid="2" tokenid="2” preferred="Heparinoid” cui="C0019142” tui="T121"/> <termid="5" tokenid="6" preferred ="Thrombin” cui="C0040018" tui="T126"/>
7
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Corpus Annotation Morpho/Syntactic Processing –TnTTokenization, Segmentation, PoS-tagging –MmorphLemmatization (German compound analysis) –ChunkiePhrase Recognition –under developmentGrammatical Function Tagging Parallel Corpus –~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) –~ 1 M Tokens for each Language –Manual Clean-Up
8
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Tokenization, POS Tagging Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon »Specialist Medical Lexicon UMLS (Englisch), ZInfo (German)
9
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Morphology, Phrase Recognition Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. –Schleimhautoedem > Schleimhaut+Oe+Dem »German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora
10
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Grammatical Function Tagging Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation”
11
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation Format (DTD) document keywordssentencetitle keywordewntermstermssemrelstextgramrelschunksewntermstermssemrelstextgramrelschunks ewntermtermsemreltokengramrelchunkewntermtermsemreltokengramrelchunk
12
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation (Example) A A 34-year-old 34-year-old HIV-infected HIV-infected African African woman woman developed developed fever fever and and weight weight loss loss on on her her trunk trunk and and arms arms.. </document>
13
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Demo...
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.