Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Slides:



Advertisements
Similar presentations
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Advertisements

Multimedia Database Systems
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
 CiteGraph: A Citation Network System for MEDLINE Articles and Analysis Qing Zhang 1,2, Hong Yu 1,3 1 University of Massachusetts Medical School, Worcester,
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 Text Classification for Healthcare Information Support Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University, Taiwan.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A language modeling framework for expert finding Presenter : Lin, Shu-Han Authors : Krisztian Balog,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏 慈濟大學醫學資訊學系 2012/06/13.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Facilitating Document Annotation using Content and Querying Value.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
Reference Collections: Collection Characteristics.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Page Ranking Algorithms for Digital Libraries Submitted By: Shikha Singla MIT-872-2K11 M.Tech(3 rd Sem) Information Technology.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Facilitating Document Annotation Using Content and Querying Value.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Data Mining for Expertise: Using Scopus to Create Lists of Experts for U.S. Department of Education Discretionary Grant Programs Good afternoon, my name.
TDM in the Life Sciences Application to Drug Repositioning *
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Improving Health Question Classification by Word Location Weights
Biomedical Text Mining and Its Applications
Lecture 12: Relevance Feedback & Query Expansion - II
Efficient Ranking of Keyword Queries Using P-trees
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Terminology problems in literature mining and NLP
Martin Rajman, Martin Vesely
Social Knowledge Mining
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Applying Key Phrase Extraction to aid Invalidity Search
Presentation 王睿.
Citation-based Extraction of Core Contents from Biomedical Articles
Dynamic Category Profiling for Text Filtering and Classification
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
INF 141: Information Retrieval
Literature retrieval for personalized cancer treatment
WSExpress: A QoS-Aware Search Engine for Web Services
Presentation transcript:

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan Identification of Biomedical Articles with Highly Related Core Contents Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Outline Background Problem definition The proposed technique: CCSE (Core Content Similarity Estimation) Empirical evaluation Conclusion

Background

Core Contents of Biomedical Articles Core contents of a scholarly article a are the textual contents about Research goal of a Research background of a Research conclusion of a

Similarity Estimation for the Core Contents Goal: retrieval of highly related articles Mining & analysis of highly related evidence A typical goal of existing search engines Challenge: recognition of the core contents Core content of an article a may be briefly expressed in the title and scattered in the abstract

Selected by biomedical experts for <erythropoietin, anemia>  They are highly related to each other Recommended by PubMed, but NOT highly related to <erythropoietin, anemia> 6

Problem Definition

Goal Developing a technique CCSE (Core Content Similarity Estimation) Given: titles and abstracts of two articles a1 and a2 Output: core content similarity between a1 and a2

Contributions CCSE works on titles and abstracts only, which are publicly available CCSE improves inter-article similarity estimation by considering the core contents of the articles

Related Work Inter-article similarity based on citation links Example I: out-link citations (by bibliographic coupling, BC) Example II: in-link citations (by co-citation, CC) Weakness: The citation links are often not available on the Internet (many articles even have no in-link citations)

Related Work (cont.) Inter-article similarity based on textual contents Working on publicly available parts Titles and abstracts Considering weights of terms (e.g., TFIDF weight) Weakness: Did not consider the core content similarity (due to the difficulty in recognizing the core contents)

The Proposed Technique: CCSE

Main Ideas How are goal terms of a1 related to the goal of a2? Title of a1 Title of a2 Abstract of a1 Abstract of a2 How are background and conclusion terms of a1 related to the background and conclusion of a2, respectively?

Main Ideas (cont.) Three kinds of relatedness of a term t the core content of an article a, Rgoal: Relatedness to goal Rback: Relatedness to background Rconc: Relatedness to conclusion Rgoal, Rback, and Rconc are estimated based on the positions of t in the title and the abstract of a

Step 1/2: Estimation of Rgoal, Rback, and Rconc based on positions of the term:

Step 2/2: Estimating inter-article similarity between two articles a1 and a2  Any mismatch between the core contents will significantly reduce the inter-article similarity  Similarity between a1 and a2 is based on goal match, background match, and conclusion match between a1 and a2

Min  Matchgoal(a1, a2) is based on how terms in the title of a1 is related to the goals of a1 and a2

Min  Matchback(a1, a2) is based on how terms in the abstract of a1 is related to the backgrounds of a1 and a2

Min  Matchconc(a1, a2) is based on how terms in the abstract of a1 is related to the conclusions of a1 and a2

Interesting Features of CCSE Inter-article similarity is composed of three parts Goal similarity Background similarity Conclusion similarity These similarities are estimated based on the positions of the terms appearing in the title and the abstract of the article Any mismatch between the core contents will significantly reduce the inter-article similarity

Empirical Evaluation

The data Two sets of articles Highly related biomedical articles: For each gene-disease pair <g,d>, collect the biomedical articles that biomedical experts selected to annotate the pair (noted by DisGeNET) Near-miss biomedical articles (Non-highly related articles): For each gene-disease pair <g,d>, collect articles using two queries: “g NOT d” and “d NOT g”

Data statistics 53 gene-disease pairs 9,875 articles, including 53 targets + 9,822 candidates 435,786 out-link references

The Baseline Systems (1) Link-based inter-article similarity Bibliographic coupling (BC) (2) Text-based inter-article similarity BM25 (one of the best in the biomedical domain) (3) Biomedical search engine PubMed (popular and one of the best in the biomedical domain)

Evaluation Criteria MAP (Mean Average Precision) If a system can rank higher those articles that are highly related to r, average precision (AvgP) for the gene-disease pair will be higher MAP is simply the average of the AvgP values for all gene-disease pairs

Average P@X If those articles that are highly related to r, are ranked at top-X position, P@X for the gene-disease pair will be higher Average P@X is simply the average of the P@X values for all gene-disease pairs

Result CCSE performs significantly better than BC and BM25

CCSE performs better than PubMed in all evaluation criteria:

Conclusion

Our Motivation: We develop CCSE that The idea of CCSE can be Core contents of scholarly articles are essential for retrieval of highly related scientific evidence, BUT The core contents are scattered in titles and abstracts of articles We develop CCSE that Estimates inter-article similarity based on the similarities in goals, backgrounds, and conclusions of two articles The idea of CCSE can be Incorporated into search engines to properly retrieve highly related scholarly articles