Presentation is loading. Please wait.

Presentation is loading. Please wait.

Citation-based Extraction of Core Contents from Biomedical Articles

Similar presentations


Presentation on theme: "Citation-based Extraction of Core Contents from Biomedical Articles"— Presentation transcript:

1 Citation-based Extraction of Core Contents from Biomedical Articles
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

2 Outline Background Problem definition The proposed technique: CoreCE
Empirical evaluation Conclusion

3 Background

4 Core Contents of Biomedical Articles
Core contents of a scholarly article a are the textual contents about Research goal of a Research background of a Research conclusion of a

5 Why Extraction of the Core Contents?
Indexing of the articles Mining & analysis of highly related evidence Keyword-based search of the articles Search engines often work by keyword input But the extraction is challenging Core content of an article a may be expressed in different ways and scattered in a.

6 Selected by biomedical experts for <erythropoietin, anemia> 
They are highly related to each other Recommended by PubMed, but not highly related to <erythropoietin, anemia> 6

7 Problem Definition

8 Goal & Contribution Goal Contribution
Given a scholarly article a, extract the core content of a Contribution Developing a technique CoreCE (Core Content Extractor) that extracts the core content based on how the article cites references  citation-based extraction

9 Related Work Extraction of citation links
In-link citations (how article a is cited by others) Out-link citations (how article a cites others)  Cannot support keyword-based retrieval Extraction of textual contents Certain important parts (e.g., titles and abstracts) Certain terms with higher weights (e.g., TFIDF weight)  But core content of an article a may be expressed in different ways and scattered in a

10 The Proposed Technique: CoreCE

11

12 Basic Definitions

13 Interesting Ideas of CoreCE
Core content of article a is extracted from Title and abstract of a, AND Titles of the references cited by a Term frequency of a term t is amplified if t appears in citation passages of the references cited by a The core content is represented by plain text Applicable to keyword-based indexing & retrieval

14 Empirical Evaluation

15 The data Two sets of articles Highly related biomedical articles:
For each gene-disease pair <g,d>, collect the biomedical articles that biomedical experts selected to annotate the pair (noted by DisGeNET) Near-miss biomedical articles (Non-highly related articles): For each gene-disease pair <g,d>, collect articles using two queries: “g NOT d” and “d NOT g”

16 Data statistics 53 gene-disease pairs 9,876 articles, including
53 targets + 9,823 candidates 435,786 out-link references

17 The Systems to Be Evaluated
(1) Title Only (2) Abstract Only (3) Title+Abstract (4) Title+Abstract+ReferenceTitles (5) Whole Article (including the main body) (6) CoreCE

18 The Underlying Inter-Article Similarity Measure
One of the state-of-the-art measures:

19 Evaluation Criterion MAP (Mean Average Precision)
If a system can rank higher those articles that are highly related to r, average precision (AvgP) for the gene-disease pair will be higher MAP is simply the average of the AvgP values for all gene-disease pairs

20 Average If those articles that are highly related to r, are ranked at top-X position, for the gene-disease pair will be higher Average is simply the average of the values for all gene-disease pairs

21 Result With the core contents extracted by CoreCE, the system performs significantly better in ranking highly related articles

22 CoreCE helps to rank highly related articles at top positions (top-1 and top-3) for a higher percentage of the testes

23 CoreCE performs better when the size is set to 5, however the performance differences are not statistically significant

24 Conclusion

25 Core content of a scholarly article a is
The fundamental basis for the indexing, retrieval, and analysis of scientific literature, BUT Scattered in a and expressed with different terms We develop CoreCE that Extracts the core content based on titles and citation passages of the references cited by a The idea of CoreCE can be Incorporated as a front-end processor for search engines to properly index scholarly articles


Download ppt "Citation-based Extraction of Core Contents from Biomedical Articles"

Similar presentations


Ads by Google