© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Overview MuchMore Objectives Semantic Annotation Semantic Resources, Term/Relation Tagging Corpus Annotation Part-of-Speech, Morphology, Chunks Grammatical Functions Annotation Format (DTD), Examples, Demo
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain Establishing a Baseline with Corpus-Based Methods Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Semantic Resources General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”) Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI UMLS C |ENG|P|L |PF|S |HIV|0| C |ENG|S|L |PF|S |HTLV-III|0| C |ENG|S|L |VS|S |Human Immunodeficiency Virus|0| C |ENG|S|L |VWS|S |Virus, Human Immunodeficiency|0| C |FIN|P|L |PF|S |HIV|3| C |FRE|P|L |PF|S |HIV|3| C |FRE|S|L |PF|S |VIRUS IMMUNODEFICIENCE HUMAINE|3| C |GER|P|L |PF|S |HIV|3| C |GER|S|L |PF|S |Humanes T-Zell-lymphotropes Virus Typ III|3| other languagesGERMAN 66,381ENGLISH 1.462,202 Concept Names (MRCON): 1.734,706 Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) Clozapine : C Pharmacologic Substance : T121 Semantic Types are organized in a Network through 54 Relations T121|T154|T047
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Term / Relation Tagging Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <termid="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C " tui="T073"/> Annotate All Possible Semantic Relations between Identified Terms within a Sentence <termid="2" tokenid="2” preferred="Heparinoid” cui="C ” tui="T121"/> <termid="5" tokenid="6" preferred ="Thrombin” cui="C " tui="T126"/>
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Corpus Annotation Morpho/Syntactic Processing –TnTTokenization, Segmentation, PoS-tagging –MmorphLemmatization (German compound analysis) –ChunkiePhrase Recognition –under developmentGrammatical Function Tagging Parallel Corpus –~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) –~ 1 M Tokens for each Language –Manual Clean-Up
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Tokenization, POS Tagging Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon »Specialist Medical Lexicon UMLS (Englisch), ZInfo (German)
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Morphology, Phrase Recognition Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. –Schleimhautoedem > Schleimhaut+Oe+Dem »German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Grammatical Function Tagging Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation”
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation Format (DTD) document keywordssentencetitle keywordewntermstermssemrelstextgramrelschunksewntermstermssemrelstextgramrelschunks ewntermtermsemreltokengramrelchunkewntermtermsemreltokengramrelchunk
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation (Example) A A 34-year-old 34-year-old HIV-infected HIV-infected African African woman woman developed developed fever fever and and weight weight loss loss on on her her trunk trunk and and arms arms.. </document>
© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Demo...