Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul.

Similar presentations


Presentation on theme: "© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul."— Presentation transcript:

1 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul Buitelaar DFKI-Language Technology Saarbrücken, Germany

2 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Overview MuchMore Objectives Semantic Annotation  Semantic Resources, Term/Relation Tagging Corpus Annotation  Part-of-Speech, Morphology, Chunks  Grammatical Functions Annotation Format (DTD), Examples, Demo

3 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI MuchMore Objectives Evaluation Systematic Comparison of CLIR Methods on a Realistic Scenario in the Medical Domain  Establishing a Baseline with Corpus-Based Methods  Comparison with Concept-Based Methods Concept-Based CLIR Effective Use of Medical and General Semantic Resources by Developing Methods for Tuning and Extension

4 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Semantic Resources General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”) Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (MeSH, ICD, …) English, German, Spanish, … 730.000 Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations

5 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI UMLS C0019682|ENG|P|L0019682|PF|S0048631|HIV|0| C0019682|ENG|S|L0020103|PF|S0049688|HTLV-III|0| C0019682|ENG|S|L0020128|VS|S0049756|Human Immunodeficiency Virus|0| C0019682|ENG|S|L0020128|VWS|S0098727|Virus, Human Immunodeficiency|0| C0019682|FIN|P|L1523437|PF|S1819346|HIV|3| C0019682|FRE|P|L0168651|PF|S0233132|HIV|3| C0019682|FRE|S|L0206547|PF|S0277133|VIRUS IMMUNODEFICIENCE HUMAINE|3| C0019682|GER|P|L0413854|PF|S0538136|HIV|3| C0019682|GER|S|L1261793|PF|S1503739|Humanes T-Zell-lymphotropes Virus Typ III|3| other languagesGERMAN 66,381ENGLISH 1.462,202 Concept Names (MRCON): 1.734,706 Each CUI (Concept Unique Identifier) is mapped to one of 134 semantic types (TUI) Clozapine : C0009079  Pharmacologic Substance : T121 Semantic Types are organized in a Network through 54 Relations T121|T154|T047

6 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Term / Relation Tagging Annotate Terms (of length 1-4 tokens) with Preferred Term, CUI and TUI <termid="13" tokenid="14, 15, 16" preferred ="Intensive Care Unit” cui="C0021708" tui="T073"/> Annotate All Possible Semantic Relations between Identified Terms within a Sentence <termid="2" tokenid="2” preferred="Heparinoid” cui="C0019142” tui="T121"/> <termid="5" tokenid="6" preferred ="Thrombin” cui="C0040018" tui="T126"/>

7 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Corpus Annotation Morpho/Syntactic Processing –TnTTokenization, Segmentation, PoS-tagging –MmorphLemmatization (German compound analysis) –ChunkiePhrase Recognition –under developmentGrammatical Function Tagging Parallel Corpus –~ 9000 English and German Medical Abstracts from 41 Journals (obtained through Springer LINK WebSite) –~ 1 M Tokens for each Language –Manual Clean-Up

8 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Tokenization, POS Tagging Tokenization Hyphenated Compounds, e.g: side-effects, short-term, follow-up Abbreviations, e.g: aquos., emulsific., Ungt. TnT PoS-Tagger (Brants, 2000) Retrain on an annotated domain-specific corpus Update underlying lexicon »Specialist Medical Lexicon  UMLS (Englisch), ZInfo (German)

9 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Morphology, Phrase Recognition Mmorph Dumped Full-Form Lexicon (domain independent) Decomposition: Problematic for German, e.g. –Schleimhautoedem > Schleimhaut+Oe+Dem »German Medical Specialist Lexicon Chunkie HMM-based Partial Parser (Skut and Brants, 2000) Recognition of internal structure of simple as well as complex NPs, PPs and APs Retraining needed on Annotated Medical Corpora

10 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Grammatical Function Tagging Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. ”Untersucht” PAS.SUBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:SUBJ ”Patienten” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:OBJ ”sich” ”unterziehen” ACT.SUBJ*OBJ*IOBJ:IOBJ ”Bypassoperation”

11 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation Format (DTD) document keywordssentencetitle keywordewntermstermssemrelstextgramrelschunksewntermstermssemrelstextgramrelschunks ewntermtermsemreltokengramrelchunkewntermtermsemreltokengramrelchunk

12 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI XML Annotation (Example) A A 34-year-old 34-year-old HIV-infected HIV-infected African African woman woman developed developed fever fever and and weight weight loss loss on on her her trunk trunk and and arms arms.. </document>

13 © Paul Buitelaar, February 2002 Corpus Annotation Day at DI Demo...


Download ppt "© Paul Buitelaar, February 2002 Corpus Annotation Day at DI Multi-Layer Annotation for Cross- Lingual Information Retrieval in the Medical Domain Paul."

Similar presentations


Ads by Google