Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Similar presentations


Presentation on theme: "Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan"— Presentation transcript:

1 Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Identification of Biomedical Articles with Highly Related Core Contents Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

2 Outline Background Problem definition
The proposed technique: CCSE (Core Content Similarity Estimation) Empirical evaluation Conclusion

3 Background

4 Core Contents of Biomedical Articles
Core contents of a scholarly article a are the textual contents about Research goal of a Research background of a Research conclusion of a

5 Similarity Estimation for the Core Contents
Goal: retrieval of highly related articles Mining & analysis of highly related evidence A typical goal of existing search engines Challenge: recognition of the core contents Core content of an article a may be briefly expressed in the title and scattered in the abstract

6 Selected by biomedical experts for <erythropoietin, anemia> 
They are highly related to each other Recommended by PubMed, but NOT highly related to <erythropoietin, anemia> 6

7 Problem Definition

8 Goal Developing a technique CCSE (Core Content Similarity Estimation)
Given: titles and abstracts of two articles a1 and a2 Output: core content similarity between a1 and a2

9 Contributions CCSE works on titles and abstracts only, which are publicly available CCSE improves inter-article similarity estimation by considering the core contents of the articles

10 Related Work Inter-article similarity based on citation links
Example I: out-link citations (by bibliographic coupling, BC) Example II: in-link citations (by co-citation, CC) Weakness: The citation links are often not available on the Internet (many articles even have no in-link citations)

11 Related Work (cont.) Inter-article similarity based on textual contents Working on publicly available parts Titles and abstracts Considering weights of terms (e.g., TFIDF weight) Weakness: Did not consider the core content similarity (due to the difficulty in recognizing the core contents)

12 The Proposed Technique: CCSE

13 Main Ideas How are goal terms of a1 related to the goal of a2?
Title of a1 Title of a2 Abstract of a1 Abstract of a2 How are background and conclusion terms of a1 related to the background and conclusion of a2, respectively?

14 Main Ideas (cont.) Three kinds of relatedness of a term t the core content of an article a, Rgoal: Relatedness to goal Rback: Relatedness to background Rconc: Relatedness to conclusion Rgoal, Rback, and Rconc are estimated based on the positions of t in the title and the abstract of a

15 Step 1/2: Estimation of Rgoal, Rback, and Rconc based on positions of the term:

16 Step 2/2: Estimating inter-article similarity between two articles a1 and a2
 Any mismatch between the core contents will significantly reduce the inter-article similarity  Similarity between a1 and a2 is based on goal match, background match, and conclusion match between a1 and a2

17 Min  Matchgoal(a1, a2) is based on how terms in the title of a1 is related to the goals of a1 and a2

18 Min  Matchback(a1, a2) is based on how terms in the abstract of a1 is related to the backgrounds of a1 and a2

19 Min  Matchconc(a1, a2) is based on how terms in the abstract of a1 is related to the conclusions of a1 and a2

20 Interesting Features of CCSE
Inter-article similarity is composed of three parts Goal similarity Background similarity Conclusion similarity These similarities are estimated based on the positions of the terms appearing in the title and the abstract of the article Any mismatch between the core contents will significantly reduce the inter-article similarity

21 Empirical Evaluation

22 The data Two sets of articles Highly related biomedical articles:
For each gene-disease pair <g,d>, collect the biomedical articles that biomedical experts selected to annotate the pair (noted by DisGeNET) Near-miss biomedical articles (Non-highly related articles): For each gene-disease pair <g,d>, collect articles using two queries: “g NOT d” and “d NOT g”

23 Data statistics 53 gene-disease pairs 9,875 articles, including
53 targets + 9,822 candidates 435,786 out-link references

24 The Baseline Systems (1) Link-based inter-article similarity
Bibliographic coupling (BC) (2) Text-based inter-article similarity BM25 (one of the best in the biomedical domain) (3) Biomedical search engine PubMed (popular and one of the best in the biomedical domain)

25 Evaluation Criteria MAP (Mean Average Precision)
If a system can rank higher those articles that are highly related to r, average precision (AvgP) for the gene-disease pair will be higher MAP is simply the average of the AvgP values for all gene-disease pairs

26 Average If those articles that are highly related to r, are ranked at top-X position, for the gene-disease pair will be higher Average is simply the average of the values for all gene-disease pairs

27 Result CCSE performs significantly better than BC and BM25

28 CCSE performs better than PubMed in all evaluation criteria:

29 Conclusion

30 Our Motivation: We develop CCSE that The idea of CCSE can be
Core contents of scholarly articles are essential for retrieval of highly related scientific evidence, BUT The core contents are scattered in titles and abstracts of articles We develop CCSE that Estimates inter-article similarity based on the similarities in goals, backgrounds, and conclusions of two articles The idea of CCSE can be Incorporated into search engines to properly retrieve highly related scholarly articles


Download ppt "Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan"

Similar presentations


Ads by Google