Download presentation
Presentation is loading. Please wait.
1
April 19 th,2002 MuchMore Project Review Multilingual Concept Hierarchies for Medical Information Organization and Retrieval MUCHMORE
2
April 19 th,2002 MuchMore Project Review Project Overview Application Addressing a Real-Life Medical Scenario for Cross-Lingual Information Retrieval Research & Development Developing Novel, Hybrid (Corpus-/Concept- Based) Methods for Handling this Scenario Evaluation Evaluating the Technical Performance of (Combinations of) Existing and Novel Methods
3
April 19 th,2002 MuchMore Project Review User Perspective (ZInfo) MuchMore Provide Relevant Medical Information … for a Specific Patient Problem … Automatically, from the Web … Independent of Language Vision: BAIK Model
4
April 19 th,2002 MuchMore Project Review Automatic Query Generation (and Expansion), Identifying the Exact Problem of the Patient Retrieval and Relevance Ranking of Evidence Based Medical Literature, Language Independent Summarization and Filtering of Results According to a User Profile User Requirements User Perspective (ZInfo)
5
April 19 th,2002 MuchMore Project Review User Evaluation Use for Medical Cases Part of Postgraduate Course in Medical Informatics Evaluate Usefulness Query Generation Relevance for Decisions in Diagnostics and Treatment Problematic Issues Different medical profiles, schools, experience, speciality Relevant for one user may mean less or nothing to another Evidence based medicine criteria exist only for a small fraction of medicine User Perspective (ZInfo)
6
April 19 th,2002 MuchMore Project Review MuchMore Prototype Overview of Prototype Functionality Relation between Functionality and User Requirements Issues Addressed by Research and Development within MuchMore
7
April 19 th,2002 MuchMore Project Review R&D in MuchMore Corpus Annotation (DFKI, ZInfo) PoS, Morphology, Phrases, Grammatical Functions Term and Relation Tagging Term Extraction (XRCE, EIT, CMU, CSLI) Bilingual Lexicon Extraction, Extension of Semantic Resources Relation Extraction (DFKI, CSLI) Grammatical Function Tagging Extracting Semantic Relation Indicators Extracting Novel Semantic Relations Sense Disambiguation (CSLI, DFKI) Tuning and Extension of Semantic Resources Combining Sense Disambiguation Methods Semantic Annotation Based CLIR Semantic Indexing/Retrieval (EIT,DFKI)
8
April 19 th,2002 MuchMore Project Review Corpus Based CLIR Bilingual Lexicon Extraction (XRCE, EIT, CMU, CSLI) Pseudo Relevance Feedback: PRF (CMU) Generalized Vector Space Model: GVSM (CMU) Summarization (CMU) Query, Genre Specific Text Classification Based CLIR (CMU) Hierarchical/Flat kNN with MeSH R&D in MuchMore Additional Approaches in CLIR
9
April 19 th,2002 MuchMore Project Review Corpus Annotation PoS Lexicon Update, Remaining Error Rate ~ 1.5% (EN) Histologically, we found a subepidermal blister formation and a predominantly neutrophilic infiltrate. pos=VB > pos_correct=NN Term and Relation Tagging Evaluation of 8 DE/EN Parallel Abstracts, Relevant for a Query Morphology German Nouns MMorphRecallIncorrectError-Rate test-dvlp889617 69.40% 426.81% test-final 989 683 69.06% 79 11.57% Incorrect, e.g.: Chorionzottenbiopsie > Chor + Ion + Zotte + Biopsie Annotation Evaluation Corpus ~ 9000 English and German Medical Abstracts from 41 Journals, Springer LINK WebSite, ~ 1 M Tokens for each Language
10
April 19 th,2002 MuchMore Project Review Term Extraction Aim Bilingual Lexicon Extraction From Comparable Corpora at Word Level; From Parallel Corpora at Word, and Term (Multi-Word) Level Bilingual Extension of Semantic Resource (MeSH) verbesserter transabdomineller Technikenimproved transabdominal techniques Prognose des Frühcarcinomsprognosis of early gastric cancer Verletzungen des Gehirnsintracranial injuries Lebensqualitaetquality of live XRCE (Aims and Resources) Resources Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH) Corpus Specific German Decompounding (Improves Recall by 25% at Equal Precision)
11
April 19 th,2002 MuchMore Project Review Optimal Combination of Resources Retaining only 10 best Translations for each Candidate 1.word-to-word, comparable corpora:F1 = 0.84 2.aword-to-word, parallel corpora:F1 = 0.98 2.bterm-to-term, parallel corpora:F1 = 0.85 Evaluating Separately with Individual Resources (F1) Corpus: 0.62; MeSH: 0.51; General Dictionary: 0.56 3.MeSH Extension: 1453 new multi-word terms added (synonyms or new term entries) extracted from the Springer corpus Term Extraction XRCE (Results of Best Method)
12
April 19 th,2002 MuchMore Project Review Method Extract Most Frequent Terms (Single Word) by Comparison of Term Frequencies in a General Corpus (German: SDA, English: LA Times) vs. Medical Corpus Term Extraction EIT (Similarity Thesauri) Results Single Word Terms (Springer Abstracts) German-English:104,904 / English-German: 49,454 Multiword Terms (Phrase Lexicon Generated from ICD10) German Phrases: 354 / English Phrases: 665 Bilingual Phrasal Entries Generated: German - English: 225 / English - German: 246
13
April 19 th,2002 MuchMore Project Review Method For each word in one language, accumulate counts of the number of times the translations of the sentences containing that word include each word of the other language. These co-occurrence counts may be restricted using word-alignment techniques. Apply a variable threshold to filter out uncommon co-occurrences which are unlikely to be translations. The result is a lexicon listing candidate translations and their relative frequencies. Results ~99.000 Bilingual Term Pairs (PubMed Parallel Abstracts) (Estimated Error Rate: < 10%) Term Extraction CMU (EBT Bilingual Lexicon)
14
April 19 th,2002 MuchMore Project Review Represent English and German Words as Vectors that are Produced by Recording the Number of Co-Occurrences of the Word in Question with each of a Set of Content-Bearing Words. Use (Cosine) Similarity Measure on these Rows to Find “Nearest Neighbours”. 1, 000 (English) content-bearing words ligament English words Kreuzband Kniegelenk German words ligament knee joint......... English German Term Extraction CSLI (Infomap System) Term (EN)SIMTerm (DE)SIM bone1.00knochen0.82 cancellous0.70knochens0.71 osteoinductive0.67knochenneubildung0.67 demineralized0.65spongiosa0.64 trabeculae0.64knochenresorption0.60 formation0.60allogenen0.60 periosteum0.56knöcherne0.59 ………
15
April 19 th,2002 MuchMore Project Review Tuning (CSLI, DFKI) Aligning Clusters with Senses C0043210|GER|P|L1254343|PF|S1496289|Frauen|3| C0043210|ENG|P|L1189496|PF|S1423265|Human adult females|0| WSD: Terms, Senses Extension (DFKI) Morphological Analysis (Decomposition) Entzündungsgewebe (infection tissue) HYPONYM Gewebe,Körpergewebe (body tissue) Gewebe, Stoff,Textilstoff (textile) Semantic Similarity (Co-Occurrence Patterns) Karzinom (carcinoma), Metastase (metastasis) SYNONYM Geschwulst, Tumor,.... Semantic Resource Extension and Tuning
16
April 19 th,2002 MuchMore Project Review WSD: Algorithm Bilingual Sense Selection (CSLI) 1 Sense in L1 vs. >1 Sense in L2 Englishblood vessel (C0005847) vs. vessel (polysaccharide) (C0148346) GermanBlutgefaesse = blood vessel (C0005847) Combination of Methods (Task, Domain, General) Collocations and Senses (CSLI) For an ambiguous single word term that is part of several unambiguous multiword terms, choose the sense of the most frequent multiword term. single word termabortion 1) a natural process C0000786 (T047) 2) a medical procedure C0000811 (T061) multiword termrecurrent abortion C0000809 (T047) => sense 1 induced abortion C0000811 (T061) => sense 2
17
April 19 th,2002 MuchMore Project Review WSD: Algorithm Domain Specific Senses (DFKI) Concept Relevance in Domain Corpus Mineral 0.030774033: Mineralstoff, Eisen, Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5: Allanit, Alumogel,..., Axionit, Beryll,... Wurtzit, Zirkon Combination of Methods (Task, Domain, General) Instance-Based Learning (DFKI) Unsupervised Context Models (n-grams) Training (Learn Class Models)He drank He drank He drank He drank Application (Apply Class Models) He drank He drank
18
April 19 th,2002 MuchMore Project Review Ambiguous: MeSH EN: 847 (2.5), DE: 780 (2.1); EWN EN: 6300 (2.8) DE: 4059 (1.5) Evaluation (Nouns): GermaNet (40), English MeSH (59), German MeSH (28) WSD: Evaluation Lexical Sample Evaluation Corpora (Medical) Band (tape, strap. ligament) Fall (drop, case, instance) Gefäss (jar, vessel) Operation (operation, surgery) Prüfung (survey, tryout, checkup) Verletzung (injury, trauma) Wahl (ballot, choice, option) Lage (site, status, position, layer) Gewicht (weight, importance) ……
19
April 19 th,2002 MuchMore Project Review Robust, Shallow Grammatical Function Tagger EM Model (Trained on Frankfurter Rundschau: 35M Tokens, Adaptation on Medical Corpora Under Development) 1.5M ‘Types’: Verb, Voice, Function, Nom-Head-Argument abarbeiten ACT SUBJ Politiker Use of PoS Information, Use of Chunk Information Planned Tags for SUBJ, OBJ, IOBJ, ACT/PAS German Available, English under Development Untersucht wurden 30 Patienten, die sich einer elektiven aortokoronaren Bypassoperation unterziehen mussten. Relation Extraction Grammatical Function Tagging (DFKI)
20
April 19 th,2002 MuchMore Project Review Cluster 1 T047/T060 (Diagnoses) T060/T101 (Affects) T060/T169... Cluster 3 T047/T121 (Treats, Causes) T061/T121 (Uses) T121/T184 (Treats)... Cluster 2 T101/T169 T101/T184 T101/T048... differentiate conclude discriminate diagnose illustrate suffer demonstrate progress develop die reduce treat follow diagnose cure T047: Disease T048: Mental Dysfunction T060: Diagnostic Procedure T101: Patient T121: Pharm. Substance T169: Funct. Concept (Syndrom) T184: Sign or Symptom Relation Extraction Semantic Relation Indicators (DFKI, CSLI) Novel Semantic Relations (DFKI, CSLI)
21
April 19 th,2002 MuchMore Project Review Maximal Marginal Relevance (MMR) Find passages most relevant to query Maximize information novelty (minimize passage redundancy) Assemble extracted passages for summary Argmax k d i in C [λS(Q, d i ) - (1-λ)max d j in R (S(d i, d j ))] Q = query, d = document, S = similarity function λ = tradeoff factor between relevance & novelty k = number of passages to include in summary Summarization (CMU) Extractive Summarization Applications Re-ranking retrieved documents from IR Engine Ranking passages from a document for inclusion in summaries Ranking passages from topically-related document cluster for cluster summary
22
April 19 th,2002 MuchMore Project Review MMR applies to English and German –Genre-based specialization (e.g. include conclusions for scientific articles) –Linguistic specialization possible Summarization should apply when retrieving FULL articles query-driven summaries instead of generic abstracts MuchMore Application TaskQuery-Relevant (focused)Query-Free (generic) INDICATIVE, for Filtering (Do I read further?) To filter search engine resultsShort abstracts CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries INDICATIVE and QUERY-RELEVANT Summarization (CMU)
23
April 19 th,2002 MuchMore Project Review Test Collection: Springer Abstracts (German and English) Query Set: 25 of 126 Selected by ZInfo Relevance Assessments Assumption : Documents Retrieved by all Runs for one Query (Intersection) are Relevant Pool Size : 500 Documents Based on 18 Runs Done by CMU, CSLI and EIT German (ZInfo): 959 Relevant Documents English (CMU): 500 Relevant Documents (1 judge) 964 Relevant Documents (3 judges) Technical Evaluation Test Data
24
April 19 th,2002 MuchMore Project Review Corpus BasedSimilarity Thesaurus (EIT) Example-based Translation (CMU) Pseudo Relevance Feedback (CMU) Generalized Vector Space Model (CMU) Hybrid Classification (CMU) H ierarchical: kNN, Rocchio Flat: kNN, Rocchio-style Classifier Semantic Annotation + Extraction (DFKI, XRCE) UMLS / XRCE Terms & Semantic Relations EuroWordNet Terms Semantic Annotation + Similarity Thesaurus Technical Evaluation Methods Evaluated
25
April 19 th,2002 MuchMore Project Review Overall Performance 11point-Average Precision (Interpolated) Performance in the High-Precision Area Assumption: User Wants to Get Most Relevant Documents Topranked within the Result List Average Interpolated Precision at Recall of 0.1 Exact Precision after 10 Retrieved Documents Applied to Experiments Evaluating Semantic Annotations Technical Evaluation TREC-Style Performance Measurements
26
April 19 th,2002 MuchMore Project Review Data Sets EIT: The Springer Parallel Corpus, i.e. 9640 Documents for English, and 9640 documents for German CMU: Half of the Corpus, i.e. a Test Set with 4820 Documents in each. SystemEng-EngGer-GerGer-EngEng-Ger Monolingual EIT: lnu.ltn0.19140.1848N/A Crosslingual EIT: SimThes & lnu.ltnN/A 0.12580.1109 Monolingual PRF0.67820.5078N/A Crosslingual PRFN/A 0.54870.5758 EBT: chi-squaredN/A 0.52320.5396 Crosslingual GVSM(first evaluation to be completed in July, 2002) Technical Evaluation Results: Corpus Based Methods
27
April 19 th,2002 MuchMore Project Review Categorization (Preliminary Results) Reuters-21578: 10,000+ documents, 90 categories Reuters Corpus Volume 1, TREC-10 version (RCV1): 783,484 documents, 84 categories Reuters Koller & Sahami subsets (ICML’98): 138 to 939 documents, 6-11 categories in a set OHSUMED: 233,445 documents, 14,321 categories SystemData SetMacro-avg F1Micro-avg F1 kNNReuters 21578.60.86 RocchioReuters 21578.59.85 kNNRCV1.TREC-10(F0.5 =.44)(F0.5 =.55) RocchioRCV1.TREC-10(F0.5 =.39)(F0.5 =.49) kNNR-KS Subsets (3).85,.81,.97.89,.80,.94 HkNNR-KS Subsets (3).85,.80,.98.86,.82,.99 RocchioR-KS Subsets (3).80,.75,.96.82,.83,.96 HRocchioR-KS Subsets (3).83,.81,.98.78,.84,.99 kNNOHSUMED.26.48 Technical Evaluation Results: Hybrid Methods
28
April 19 th,2002 MuchMore Project Review Semantic Annotation + Extraction Data SetFull Springer Corpus Weighting SchemeCoordination Level Matching (CLM): 1. Pass: Documents Preferred Containing Matching Terms or Semantic Relations 2. Pass: All Features Using lnu.ltn Rel. AssessmentsGerman System 11pt AvPrecPrec at Recall of 0.1Prec at 10 Docs Retr SemA-v3SemA-v4Sem-Av3SemA-v4SemA-v3SemAv4 EN2DE: Morph & EWN-0.0005-0.0017-0.0040 EN2DE: Morph & UMLS-0.0933-0.2898-0.1840 EN2DE: Morph& UMLS & XRCE-0.1486-0.4258-0.3360 DE2EN: Morph & EWN-0.0479-0.1240-0.0960 DE2EN: Morph & UMLS0.15070.13920.38950.39630.25200.2920 Technical Evaluation Results: Hybrid Methods
29
April 19 th,2002 MuchMore Project Review Semantic Annotation + Similarity Thesaurus Data SetFull Springer Corpus Weighting SchemeCoordination Level Matching (CLM) Rel. AssessmentsGerman System 11pt AvPrec Prec at Recall of 0.1 Prec at 10 Docs Retr EN2DE: transl. Morphology & EWN0.02760.13530.1000 EN2DE: transl. Morphology & UMLS0.14870.41260.3320 EN2DE: transl. Morphology & UMLS & XRCE0.17060.44950.3600 DE2EN: transl. Morphology & EWN0.11010.31650.2000 DE2EN: transl. Morphology & UMLS0.14130.40380.2680 Technical Evaluation Results: Hybrid Methods
30
April 19 th,2002 MuchMore Project Review Assumption: CLIR achieves up to 75 % of Monolingual Baseline (11pt Average Precision) Corpus-based Methods (Compared to Monolingual PRF) German – EnglishPRF: 81 %, EBT: 77 %, EIT: 66% English – GermanPRF: 113 %, EBT: 106 %, EIT: 60% Hybrid Methods (Compared to Monolingual EIT) German – English: 73 % (UMLS Terms & SemRels) English – German: 50 % (UMLS Terms & SemRels) English – German: 80 % (UMLS Terms & SemRels & XRCE Terms) German – English: 74 % (SimThes & UMLS Terms & SemRels) English – German: 80 % (SimThes & UMLS Terms & SemRels) English – German: 92 % (SimThes & UMLS Terms & SemRels & XRCE Terms) Technical Evaluation Summary of the Results
31
April 19 th,2002 MuchMore Project Review Corpus Collection Comparable Medical Document Corpora are Very Difficult to Obtain, Anonymization Must be Validated by Hospital CIO Work with „Shuffled“ Parallel Corpus Radiology Reports (~600.000) Available in German, to be Obtained for English Management Deviations from the Work Plan Corpus Annotation More Efforts on Improving PoS Tagging and Morphological Analysis (English and German Medical Specialist Lexicon) Relation Extraction More Efforts on Grammatical Function Tagging as Preprocessing for Semantic Relation Tagging and Extraction
32
April 19 th,2002 MuchMore Project Review R&D Topics Ontology Development Combining Axes in AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI, DFKI) Semantic Web Semantic Annotation of Medical Documents with Metadata (UMLS in Protégé) Management Future Prospects and Activities Related Projects and Workshops Project Proposal IKAR/OS on KM & Visualization in Life Sciences OntoWeb SIG on LT in Ontology Development and Use MuchMore Workshop with Invited Experts in Medical Information Access, CLIR and Semantic Annotation (September 2002) ZInfo/MuchMore Workshop on Electronic Patient Records (Spring 2003)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.