Download presentation
Presentation is loading. Please wait.
Published byAdele Richard Modified over 9 years ago
1
A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton 1
2
Outline Problem statement Techniques and methods Experimental results Discussion and conclusion 2 CIKM 2008 By Clement Yu from UIC
3
Problem statement Given a complex biological question, output relevant passages (or excerpts) where the answer can be found. 3 CIKM 2008 By Clement Yu from UIC
4
What [GENES] are involved in insect segmentation? A sample question: A sample relevant passage: An Example 4 CIKM 2008 By Clement Yu from UIC In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects. Target: GENES Qualification concepts: 1) insect 2) segmentation [hb, ftz, and eve are targets found in the passage]
5
Technique and methods Identify concepts in queries and texts Use of domain knowledge Related concepts (query expansion) Gene symbol disambiguation Conceptual IR models 5 CIKM 2008 By Clement Yu from UIC
6
In texts Window size: all component words appear within a certain window size. An example :...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon, but not rectal, cancer than do women who...”, [Query concept: colon cancer] Identify concepts in queries and texts In queries PubMed automatic term mapping 6 CIKM 2008 By Clement Yu from UIC
7
Use of domain knowledge Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant. Example: Query: What [GENES] are involved axon guidance in C.elegans? An irrelevant passage because of a different species: “ We describe DPTP52F, which is probably the last remaining RPTP encoded in the Drosophila genome. Ptp52F mutations cause specific CNS and motor axon guidance phenotypes, and exhibit genetic interactions with mutations in the other Rptp genes”. [ Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans ] 7 CIKM 2008 By Clement Yu from UIC
8
Use of domain knowledge Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes from Entrez gene and map them to the TREC entity types. An example: [Target types] : TUMOR TYPES [Dictionary] : UMLS Metathesaurus [Instances] : Lung Cancer; T-cell lymphoma; Pheochromocytoma 8 CIKM 2008 By Clement Yu from UIC
9
Related concepts Synonyms Hyponyms (one-level only) Hypernyms (one-level only) Lexical variants Related abbreviations 9 CIKM 2008 By Clement Yu from UIC
10
Related concepts : lexical variants Type 1: Automatically generate lexical variants using manually created heuristics: e.g., PLA2 PLA 2, PLAII, and PLA II Note: PLA2: Phospholipase A2 10 CIKM 2008 By Clement Yu from UIC
11
Related concepts : lexical variants Type 2: Retrieve additional lexical variants from a term database of MEDLINE e.g., PLA2 PL-A2 Note: PLA2: Phospholipase A2 11 CIKM 2008 By Clement Yu from UIC
12
Related concepts – Lexical variants 12 CIKM 2008 By Clement Yu from UIC 6 sub types of Type 3 Type 3.1:Identical after stemming Example: APC: "antigen presenting cell" ≈ "antigen presented cell" Type 3.2: Different by a small edit distance Example: HPV: "Human papillomavirus" ≈ "Human papillomaviral" Type 3.3: Identical after normalization Example: NFkb: "Nuclear factor kappa beta" ≈ "Nuclear factor kb" Type 3.4: Different ordering Example: Abeta: "amyloid beta protein“ ≈ "beta amyloid protein" Type 3.5: Extra words Example: ACD: " cerebral amyloid angiopathies " ≈ " cerebral beta amyloid angiopathies " Type 3.6: Internal abbreviations Example: APC: "ag presenting cell" ≈ "antigen presenting cell" Type 3: Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr.
13
Related concepts: related abbreviations Abbreviations whose definitions (or long- forms) consume the query concept. For example some related abbreviations for concept “lung cancer” are): SCLC (small cell lung cancer) LCSS (lung cancer symptom scale) NSCLC(non-small cell lung cancer) CIKM 2008 By Clement Yu from UIC 13
14
Gene symbol disambiguation CIKM 2008 By Clement Yu from UIC 14 3 simple rules are defined to disambiguate gene symbols from Abbreviations of non-gene meanings (Rule 1 & 2) Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154- KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [ NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”] Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor, was one of the genes identified in this study. ” [“ Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”]
15
Conceptual IR Models Model 1 Differentiate target instances Model 2 Equally weight target instances CIKM 2008 By Clement Yu from UIC 15
16
Conceptual IR Models – Model 1 CIKM 2008 By Clement Yu from UIC 16
17
Conceptual IR Models – Model 2 CIKM 2008 By Clement Yu from UIC 17
18
Experimental results Data sets and evaluation metrics Impact of different techniques and methods Comparison with best reported results CIKM 2008 By Clement Yu from UIC 18
19
Data sets and evaluation metrics Query collection: 36 questions collected from biologists in 2007. Document collection : 162,259 Highwire full-text documents in HTML format. Performance Metrics Passage MAP Aspect MAP Document MAP CIKM 2008 By Clement Yu from UIC 19
20
Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC 20
21
Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC 21
22
Comparison with best reported results CIKM 2008 By Clement Yu from UIC 22 The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval).
23
Summary Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness. Achieved significant improvement over the best reported results Compared two conceptual IR models in retrieval effectiveness Evaluated a simple method for gene symbol disambiguation 23 CIKM 2008 By Clement Yu from UIC
24
Conclusions 1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness. 24 CIKM 2008 By Clement Yu from UIC
25
Conclusions 2 : The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses. 25 CIKM 2008 By Clement Yu from UIC
26
Future work Improve the quality of target instances retrieved from different resources Improve gene symbol disambiguation method Handle pronouns More evaluations on other gold standards 26 CIKM 2008 By Clement Yu from UIC
27
Questiosn Thanks CIKM 2008 By Clement Yu from UIC 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.