Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,

Slides:



Advertisements
Similar presentations
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Advertisements

Biological literature mining
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Chapter 5: Information Retrieval and Web Search
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Knowledge Integration for Gene Target Selection Graciela Gonzalez, PhD Juan C. Uribe Contact:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz (KDD`08) Speaker: Hsu, Yi Ling.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Frame an IR Research Problem and Form Hypotheses ChengXiang Zhai Department.
Flexible Text Mining using Interactive Information Extraction David Milward
Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Chapter 6: Information Retrieval and Web Search
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Mining fuzzy domain ontology based on concept Vector from wikipedia category network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
BeeSpace Informatics: Interactive System for Functional Analysis Bruce Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Automatic Labeling of Multinomial Topic Models
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Biomedical Text Mining and Its Applications
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Applying Key Phrase Extraction to aid Invalidity Search
Citation-based Extraction of Core Contents from Biomedical Articles
Block Matching for Ontologies
Batyr Charyyev.
Introduction to Search Engines
Presentation transcript:

Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI, B. SCHATZ Department of Computer Science Institute for Genomic Biology University of Illinois at Urbana-Champaign

Outline Introduction System Demo Conclusion and Future Work

Motivation Finding all the information we know about a gene from the literature is a critical task in biology research Reading all the relevant articles about a gene is time consuming A summary of what we know about a gene would help biologists to access the already- discovered knowledge

An Ideal Gene Summary GP EL SI GI MP WFPI

Problem with Current Situation? Manually generated Labor-intensive Hard to keep updated with the rapid growth of the literature information How can we generate such summaries automatically?

Our solution Structured summary on 6 aspects 1.Gene products (GP) 2.Expression location (EL) 3.Sequence information (SI) 4.Wild-type function and phenotypic information (WFPI) 5.Mutant phenotype (MP) 6.Genetical interaction (GI) 2-stage summarization –Retrieve relevant articles by keyword match –Extract most informative and relevant sentences for 6 aspects.

System Overview: 2-stage

Demo Flybase Beespace Gene Summarizer

Summary example (Abl)

Summary example (Camo|Sod)

Conclusion and future work Developed a system using IR and IE techniques to automatically summarize information about genes from PubMed abstracts Dependency on the high-quality training data in FlyBase –Incorporate more training data from other model organisms database and resources such as GeneRIF in Entrez Gene –Mixture of data from different resources will reduce the domain bias and help to build a general tool for gene summarization. –Cross species application: summarize Bee genes using other organism’s training data, eg., fly, mouse? Automatic hypothesis generating: concern the summary as the knowledge base about genes, derive relationship (interactions) between genes.

Thanks

Related work Mostly on IE: using NLP to identify relevant phrases and relations in text, such as protein-protein interactions (Ref.[1],[2]) Genomics Track in TREC (Text REtrieval Conference) 2003: extracting the GeneRIF statement from the MEDLINE article News summarization (Ref. [3])

Keyword Retrieval Module Dictionary-based keyword retrieval: to retrieve all documents containing any synonyms of the target gene. –Input: gene name –Output: relevant documents 1.Gene SynSet Construction 2.Keyword retrieval

KR module

Gene SynSet Construction Gene SynSet: a set of synonyms of the target gene Variation in gene name spelling –gene cAMP dependent protein kinase 2: PKA C2, Pka C2, Pka-C2,… –normalized to “pka c 2” Enforce the exact match of the token sequence

Information Extraction Module Takes a set of documents returned from the KR module, and extracts sentences that contain useful factual information about the target gene. –Input: relevant documents –Output: gene summary 1.Training data generation 2.Sentence extraction

IE module

Training Data Generation construct a training data set consisting of “typical” sentences for describing the six categories using three resources –the Summary pages ( 0017) –the Attributed data pages ( 0017&content=ref-data) 0017&content=ref-data –the references

Sentence Extraction To extract sentences related to each category for the target gene, we consider 3 aspects of information –Relevance to each specified category –Relevance to its source document –Sentence location in its source abstract

Scoring strategies Category relevance score (S c ): –Vector space model: Vc for each category, Vs for each sentence, Sc = cos(Vc, Vs ) Document relevance score (S d ): –V d for each document, S d = cos(V d, Vs ) Location score (S l ): –S l = 1 for the last sentence of an abstract, 0 otherwise. Sentence Ranking: S=0.5Sc+0.3S d +0.2S l

Summary generation Keep only 2 top-ranked categories for each sentence. Generate a paragraph-long summary by combining the top sentence of each category Pick top sentences with score >threshold as the category-based summary, similar to the “attribute data” report in FlyBase

Experiments PubMed abstracts on “Drosophila” Implementation on top of Lemur Toolkit 10 genes are randomly selected from Flybase for evaluation

Evaluation 3 experiments conducted on the sentences containing the target gene, and top-k precisions are calculated. –Baseline run (BL): randomly select k sentences –CatRel: use Category Relevance Score to rank sentences and select the top-k –Comb: combine three scores to rank sentences Ask two annotators with domain knowledge to judge the relevance for each category Criterion: A sentence is considered to be relevant to a category if and only if it contains information on this aspect, regardless of its extra information, if any.

Precision of the top-k sentences

Discussion Improvements over the baseline are most pronounced for EL, SI, MP, GI categories. –These four categories are more specific and thus easier to detect than the other two GP, WFPI. Problem of predefined categories –Not all genes fit into this framework. E.g., gene Amy-d, as an enzyme involved in carbohydrate metabolism, is not typically studied by genetic means, thus low precision of MP, GI. –Not a major problem: low precision in some occasions is probably caused by the fact that there is little research on this aspect.

Conclusion and future work Proposed a novel problem in biomedical text mining: automatic structured gene summarization Developed a system using IR and IE techniques to automatically summarize information about genes from PubMed abstracts Dependency on the high-quality training data in FlyBase –Incorporate more training data from other model organisms database and resources such as GeneRIF in Entrez Gene –Mixture of data from different resources will reduce the domain bias and help to build a general tool for gene summarization.

References 1.L. Hirschman, J. C. Park, J. Tsujii, L. Wong, C. H. Wu, (2002) Accomplishments and challenges in literature data mining for biology. Bioinformatics 18(12): H. Shatkay, R. Feldman, (2003) Mining the Biomedical Literature in the Genomic Era: An Overview. JCB, 10(6): D. Marcu, (2003) Automatic Abstracting. Encyclopedia of Library and Information Science,

Vector Space Model Term vector: reflects the use of different words w i,j : weight of term t i in vactor j