Citation-based Extraction of Core Contents from Biomedical Articles

Slides:

Advertisements

Similar presentations

Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.

Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.

Multimedia Database Systems

ANALYSING RESEARCH – A GLOBAL PERSPECTIVE Krzysztof Szymanski – Country Manager Thomson Reuters October 2009.

Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.

Automating Discovery from Biomedical Texts Marti Hearst & Barbara Rosario UC Berkeley Agyinc Visit August 16, 2000.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.

1 Using Scopus for Literature Research. 2 Why Scopus?  A comprehensive abstract and citation database of peer- reviewed literature and quality web sources.

B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

 CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,

University of Economics Prague Information Extraction (WP6) Martin Labský MedIEQ meeting Helsinki, 24th October 2006.

Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140.

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Research Paper Recommender System Monica D ă g ă diţ ă.

Chapter 6: Information Retrieval and Web Search

Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.

Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏慈濟大學醫學資訊學系 2012/06/13.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Facilitating Document Annotation using Content and Querying Value.

Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.

Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,

Reference Collections: Collection Characteristics.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.

Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.

A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Facilitating Document Annotation Using Content and Querying Value.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.

Search Engine Optimization

TDM in the Life Sciences Application to Drug Repositioning *

Automatic selection of references for the creation of a biomedical literature review using citation mapping Houcemeddine Turki Faculty of Medicine of Sfax,

Queensland University of Technology

Review of Related Literature

Improving Search Relevance for Short Queries in Community Question Answering Date： 2014/09/25 Author ： Haocheng Wu, Wei Wu, Ming Zhou, Enhong Chen, Lei.

Linguistic Graph Similarity for News Sentence Searching

Analysis of Biomedical and Health Technology Trend

Improving Health Question Classification by Word Location Weights

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Parts of an Academic Paper

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Writing a Research Abstract

Martin Rajman, Martin Vesely

Bibliometric Analysis of Water Research

Applying Key Phrase Extraction to aid Invalidity Search

UC policy states: "Superior intellectual attainment, as evidenced both in teaching and in research or other creative achievement, is an indispensable.

Introduction of KNS55 Platform

YOUR TITLE/RESEARCH QUESTION ABSTRACT DISCUSSION YOUR NAME

Introduction to Information Retrieval

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Recuperação de Informação B

Literature retrieval for personalized cancer treatment

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

Citation-based Extraction of Core Contents from Biomedical Articles Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Outline Background Problem definition The proposed technique: CoreCE Empirical evaluation Conclusion

Background

Core Contents of Biomedical Articles Core contents of a scholarly article a are the textual contents about Research goal of a Research background of a Research conclusion of a

Why Extraction of the Core Contents? Indexing of the articles Mining & analysis of highly related evidence Keyword-based search of the articles Search engines often work by keyword input But the extraction is challenging Core content of an article a may be expressed in different ways and scattered in a.

Selected by biomedical experts for <erythropoietin, anemia>  They are highly related to each other Recommended by PubMed, but not highly related to <erythropoietin, anemia> 6

Problem Definition

Goal & Contribution Goal Contribution Given a scholarly article a, extract the core content of a Contribution Developing a technique CoreCE (Core Content Extractor) that extracts the core content based on how the article cites references  citation-based extraction

Related Work Extraction of citation links In-link citations (how article a is cited by others) Out-link citations (how article a cites others)  Cannot support keyword-based retrieval Extraction of textual contents Certain important parts (e.g., titles and abstracts) Certain terms with higher weights (e.g., TFIDF weight)  But core content of an article a may be expressed in different ways and scattered in a

The Proposed Technique: CoreCE

Basic Definitions

Interesting Ideas of CoreCE Core content of article a is extracted from Title and abstract of a, AND Titles of the references cited by a Term frequency of a term t is amplified if t appears in citation passages of the references cited by a The core content is represented by plain text Applicable to keyword-based indexing & retrieval

Empirical Evaluation

The data Two sets of articles Highly related biomedical articles: For each gene-disease pair <g,d>, collect the biomedical articles that biomedical experts selected to annotate the pair (noted by DisGeNET) Near-miss biomedical articles (Non-highly related articles): For each gene-disease pair <g,d>, collect articles using two queries: “g NOT d” and “d NOT g”

Data statistics 53 gene-disease pairs 9,876 articles, including 53 targets + 9,823 candidates 435,786 out-link references

The Systems to Be Evaluated (1) Title Only (2) Abstract Only (3) Title+Abstract (4) Title+Abstract+ReferenceTitles (5) Whole Article (including the main body) (6) CoreCE

The Underlying Inter-Article Similarity Measure One of the state-of-the-art measures:

Evaluation Criterion MAP (Mean Average Precision) If a system can rank higher those articles that are highly related to r, average precision (AvgP) for the gene-disease pair will be higher MAP is simply the average of the AvgP values for all gene-disease pairs

Average P@X If those articles that are highly related to r, are ranked at top-X position, P@X for the gene-disease pair will be higher Average P@X is simply the average of the P@X values for all gene-disease pairs

Result With the core contents extracted by CoreCE, the system performs significantly better in ranking highly related articles

CoreCE helps to rank highly related articles at top positions (top-1 and top-3) for a higher percentage of the testes

CoreCE performs better when the size is set to 5, however the performance differences are not statistically significant

Conclusion

Core content of a scholarly article a is The fundamental basis for the indexing, retrieval, and analysis of scientific literature, BUT Scattered in a and expressed with different terms We develop CoreCE that Extracts the core content based on titles and citation passages of the references cited by a The idea of CoreCE can be Incorporated as a front-end processor for search engines to properly index scholarly articles