A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Chapter 5: Introduction to Information Retrieval
Biological literature mining
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
INFO 624 Week 3 Retrieval System Evaluation
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Presented by Zeehasham Rasheed
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
This chapter is extracted from Sommerville’s slides. Text book chapter
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
Annual reports and feedback from UMLS licensees Kin Wah Fung MD, MSc, MA The UMLS Team National Library of Medicine Workshop on the Future of the UMLS.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Querying Structured Text in an XML Database By Xuemei Luo.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Correlating Knowledge Using NLP: Relationships between the concepts of blood cancers, stem cell transplantation, and biomarkers Katy Zou and Weizhong Zhu.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
CIKM Recognition and Classification of Noun Phrases in Queries for Effective Retrieval Wei Zhang 1 Shuang Liu 2 Clement Yu 1
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏 慈濟大學醫學資訊學系 2012/06/13.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine Jimmy Lin and Dina Demner-Fushman University of Maryland SIGIR.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
UIC at TREC 2007: Genomics Track Wei Zhou, Clement Yu University of Illinois at Chicago Nov. 8, 2007.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Ontology Evaluation Outline Motivation Evaluation Criteria Evaluation Measures Evaluation Approaches.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
How to Use This Presentation
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CCNT Lab of Zhejiang University
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Mental Functioning and the Gene Ontology
and Knowledge Graphs for Query Expansion Saeid Balaneshinkordan
Annotation: linking literature to gene products
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Presentation transcript:

A System for Finding Biological Entities that Satisfy Certain Conditions from Texts Wei Zhou, Clement Yu University of Illinois at Chicago Weiyi, Meng SUNY at Binghamton 1

Outline Problem statement Techniques and methods Experimental results Discussion and conclusion 2 CIKM 2008 By Clement Yu from UIC

Problem statement Given a complex biological question, output relevant passages (or excerpts) where the answer can be found. 3 CIKM 2008 By Clement Yu from UIC

What [GENES] are involved in insect segmentation? A sample question: A sample relevant passage: An Example 4 CIKM 2008 By Clement Yu from UIC In all insect species examined, neural expression of hb is conserved, suggesting that a neural function is ancestral. However, as the expression of the eve and ftz genes during segmentation is not conserved between grasshopper and Drosophila, and these genes lie below gap genes such as hb in the Drosophila segmentation hierarchy, it was unclear whether the role of hb in AP patterning would be conserved in more basal insects. Target: GENES Qualification concepts: 1) insect 2) segmentation [hb, ftz, and eve are targets found in the passage]

Technique and methods Identify concepts in queries and texts Use of domain knowledge Related concepts (query expansion) Gene symbol disambiguation Conceptual IR models 5 CIKM 2008 By Clement Yu from UIC

In texts Window size: all component words appear within a certain window size. An example :...Women who are postmenopausal and who have never used hormone replacement therapy have a higher risk of colon, but not rectal, cancer than do women who...”, [Query concept: colon cancer] Identify concepts in queries and texts In queries PubMed automatic term mapping 6 CIKM 2008 By Clement Yu from UIC

Use of domain knowledge Gene/protein species control (rule-based): if a query is asking for genes/proteins related to a specific species, then genes/proteins related to other species are considered irrelevant. Example: Query: What [GENES] are involved axon guidance in C.elegans? An irrelevant passage because of a different species: “ We describe DPTP52F, which is probably the last remaining RPTP encoded in the Drosophila genome. Ptp52F mutations cause specific CNS and motor axon guidance phenotypes, and exhibit genetic interactions with mutations in the other Rptp genes”. [ Ptp52F is not a relevant target because the passage is about Drosophila, not C.elegans ] 7 CIKM 2008 By Clement Yu from UIC

Use of domain knowledge Compilation of Instances from Thesauruses: Retrieve concepts from UMLS, genes from Entrez gene and map them to the TREC entity types. An example: [Target types] : TUMOR TYPES [Dictionary] : UMLS Metathesaurus [Instances] : Lung Cancer; T-cell lymphoma; Pheochromocytoma 8 CIKM 2008 By Clement Yu from UIC

Related concepts Synonyms Hyponyms (one-level only) Hypernyms (one-level only) Lexical variants Related abbreviations 9 CIKM 2008 By Clement Yu from UIC

Related concepts : lexical variants Type 1: Automatically generate lexical variants using manually created heuristics: e.g., PLA2  PLA 2, PLAII, and PLA II Note: PLA2: Phospholipase A2 10 CIKM 2008 By Clement Yu from UIC

Related concepts : lexical variants Type 2: Retrieve additional lexical variants from a term database of MEDLINE e.g., PLA2  PL-A2 Note: PLA2: Phospholipase A2 11 CIKM 2008 By Clement Yu from UIC

Related concepts – Lexical variants 12 CIKM 2008 By Clement Yu from UIC 6 sub types of Type 3 Type 3.1:Identical after stemming Example: APC: "antigen presenting cell" ≈ "antigen presented cell" Type 3.2: Different by a small edit distance Example: HPV: "Human papillomavirus" ≈ "Human papillomaviral" Type 3.3: Identical after normalization Example: NFkb: "Nuclear factor kappa beta" ≈ "Nuclear factor kb" Type 3.4: Different ordering Example: Abeta: "amyloid beta protein“ ≈ "beta amyloid protein" Type 3.5: Extra words Example: ACD: " cerebral amyloid angiopathies " ≈ " cerebral beta amyloid angiopathies " Type 3.6: Internal abbreviations Example: APC: "ag presenting cell" ≈ "antigen presenting cell" Type 3: Retrieve additional lexical variants by recognizing equiv. long-forms of an abbr.

Related concepts: related abbreviations Abbreviations whose definitions (or long- forms) consume the query concept. For example some related abbreviations for concept “lung cancer” are):  SCLC (small cell lung cancer)  LCSS (lung cancer symptom scale)  NSCLC(non-small cell lung cancer) CIKM 2008 By Clement Yu from UIC 13

Gene symbol disambiguation CIKM 2008 By Clement Yu from UIC 14 3 simple rules are defined to disambiguate gene symbols from  Abbreviations of non-gene meanings (Rule 1 & 2) Example: “Here, utilizing non-obese diabetic (NOD) mice deficient for CD154 (CD154- KO/NOD), we have identified a mandatory role of CD4 T cells as the functional source of CD154 in the initiation of T1DM. ” [ NOD is a gene symbol, but it has a non-gene meaning here because it has a non-gene definition “non-obese diabetic”]  Common English words (Rule 3) Example: “The Kit gene, which codes for the KIT ligand (KITL) receptor or stem cell factor, was one of the genes identified in this study. ” [“ Kit” is a common English word, but it has a gene meaning here because of the adjacent word “gene”]

Conceptual IR Models Model 1  Differentiate target instances Model 2  Equally weight target instances CIKM 2008 By Clement Yu from UIC 15

Conceptual IR Models – Model 1 CIKM 2008 By Clement Yu from UIC 16

Conceptual IR Models – Model 2 CIKM 2008 By Clement Yu from UIC 17

Experimental results Data sets and evaluation metrics Impact of different techniques and methods Comparison with best reported results CIKM 2008 By Clement Yu from UIC 18

Data sets and evaluation metrics Query collection: 36 questions collected from biologists in Document collection : 162,259 Highwire full-text documents in HTML format. Performance Metrics  Passage MAP  Aspect MAP  Document MAP CIKM 2008 By Clement Yu from UIC 19

Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC 20

Impact of different techniques and methods CIKM 2008 By Clement Yu from UIC 21

Comparison with best reported results CIKM 2008 By Clement Yu from UIC 22 The improvement of our result over the best reported results is significant (22% for automatic and 16.7% for non-automatic in passage retrieval).

Summary Studied five different levels of related concepts for query expansion and examined their impacts on retrieval effectiveness. Achieved significant improvement over the best reported results Compared two conceptual IR models in retrieval effectiveness Evaluated a simple method for gene symbol disambiguation 23 CIKM 2008 By Clement Yu from UIC

Conclusions 1. Incorporating domain-specific knowledge through query expansion using multiple semantic relations significantly improved the retrieval effectiveness. 24 CIKM 2008 By Clement Yu from UIC

Conclusions 2 : The biggest improvement comes from the lexical variants. This result also indicates that biologists are likely to use different variants of the same concept according to their own writing preferences and these variants might not be collected in the existing biomedical thesauruses. 25 CIKM 2008 By Clement Yu from UIC

Future work Improve the quality of target instances retrieved from different resources Improve gene symbol disambiguation method Handle pronouns More evaluations on other gold standards 26 CIKM 2008 By Clement Yu from UIC

Questiosn Thanks CIKM 2008 By Clement Yu from UIC 27