Presentation is loading. Please wait.

Presentation is loading. Please wait.

PLOS Facilitating Text & Data Mining The Role the Publisher Can Play

Similar presentations

Presentation on theme: "PLOS Facilitating Text & Data Mining The Role the Publisher Can Play"— Presentation transcript:

1 PLOS Facilitating Text & Data Mining The Role the Publisher Can Play
Rosemary Dickin, Editorial Manager, PLOS Computational Biology and PLOS Genetics July 2017 The Role the Publisher Can Play Rosemary Dickin, Editorial Manager, PLOS Computational Biology and PLOS Genetics July 2017

2 Outline PLOS Policies How we support TDM Outline PLOS Policies

3 Background & Policies PHOTOS AND CAPTION ON BLACK

4 PLOS Mission PLOS is a non-profit publisher and advocacy organization with a mission to accelerate progress in science and medicine by leading a transformation in research communication. [Founded in 2001,] PLOS is a non-profit publisher [of seven journals] and advocacy organization with a mission to accelerate progress in science and medicine by leading a transformation in research communication.

5 We publish a lot of content: over 25k articles each year
Over 190k articles in total up to 2016 We want people to read and reuse our content

6 PLOS Core Principles PLOS and its authors choose to make scientific and medical research articles openly available for the advancement of science and the greater public good. PLOS and its authors choose to make scientific and medical research articles openly available for the advancement of science and the greater public good.

7 PLOS Supports Text & Data Mining
We believe that TDM is an important research methodology that must be supported by the keepers of the scholarly literature, funders, academic institutions – all those involved in the research endeavour. We believe that TDM is an important research methodology that must be supported by the keepers of the scholarly literature, funders, academic institutions – all those involved in the research endeavour.

8 PLOS Supports Text & Data Mining
By making all of our published content open access, PLOS is facilitating TDM. We hope to offer better options for accessing that content to TDM researchers moving forward. PLOS participates in industry efforts to further facilitate TDM and encourages all publishers to open their content stores to enable TDM with minimal barriers or obstacles. By making all of our published content open access, PLOS is facilitating TDM. We hope to offer better options for accessing that content to TDM researchers moving forward. PLOS participates in industry efforts to further facilitate TDM and encourages all publishers to open their content stores to enable TDM with minimal barriers or obstacles.

9 PLOS is a Signatory to The Hague Declaration
PLOS is a signatory to and original participant in The Hague Declaration, which aims to foster agreement about how to best enable access to facts, data and ideas for knowledge discovery in the Digital Age. The declaration calls for intellectual property reform, policies to enable and reward TDM, and the development of technology and tools to allow TDM. Source: CC-BY

10 Text Mining Collection
PLOS also has a collection of 38 articles containing research, opinion and education relating to TDM from across the PLOS journals

11 Data Availability Policy
“PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. When submitting a manuscript online, authors must provide a Data Availability Statement describing compliance with PLOS's policy. If the article is accepted for publication, the data availability statement will be published as part of the final article.” Since 2014, all PLOS journals have required authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. This increases both the amount of data shared and the ease of locating data connected to an article. We believe that requiring data is the first step to ensuring greater reproducibility of research; something that TDM can also benefit.

12 Facilitating Text & Data Mining

13 Facilitating Text & Data Mining
We provide unrestricted access to all of our articles and supplemental data in several different formats. We encourage TDM researchers to understand & use JATS XML because it provides data about the article as well as the article text in one standardized structured text file. We direct researchers to data for PLOS journals as well as other open source journals. PLOS is sometimes contacted by researchers looking for assistance in TDM. This information is also all on our website. The preferred method of access depends on the use case. We provide unrestricted access to all of our articles and supplemental data in several different formats. We encourage TDM researchers to understand & use JATS XML because it provides data about the article as well as the article text in one standardized structured text file. We direct researchers to data for PLOS journals as well as other open source journals. Info:

14 PLOS Search API Every PLOS article is indexed by DOI in our Solr search API. The search API can be used to download PLOS article metadata, to identify a subset of articles of interest, or to get the DOI of every published PLOS article. Every PLOS article is indexed by DOI in our Solr search API. The search API can be used to download PLOS article metadata, to identify a subset of articles of interest, or to get the DOI of every published PLOS article. Info:

15 Bulk Downloads Bulk downloading is the most efficient method for obtaining a copy of the entire corpus. PubMed Central (PMC) has made this extremely easy by packaging the Open Access Subset of research articles from multiple journals into single files and making them available via the PMC OA Bulk Download FTP site. [Text and Data Miners (TDM) generally want a copy of the entire corpus and write specialized software to process the data.] Bulk downloading is the most efficient method for obtaining a copy of the entire corpus. We encourage them to go via PubMed Central (PMC), which has made this extremely easy by packaging the Open Access Subset of research articles from multiple journals into single files and making them available via the PMC OA Bulk Download FTP site. Info:

16 Open Access & TDM Open Access (OA) journals can help TDM:
OA article text and meta-data is provided in a single XML file format (JATS), giving the ability to process articles from multiple journals in addition to PLOS. OA articles are freely available to download and use for TDM as part of our CC-BY license standard.  OA publishers syndicate articles to PMC which provides this data as an ongoing service that is updated on a regular basis.  Writing specialized software takes time and effort. Writing software to download data from literally hundreds or thousands of journals is a huge barrier for TDM. Open Access (OA) journals remove this barrier in several important ways. OA article text and meta-data is provided in a single XML file format: the Journal Archive and Interchange Tag Set (JATS). Writing software to process JATS XML requires a larger upfront investment but the reward is the ability to process articles from multiple journals in addition to PLOS. Secondly OA articles are freely available to download and use for TDM as part of our CC-BY license standard.   Individual publisher API’s change frequently or do not exist. OA publishers syndicate articles to PMC which provides this data as an ongoing service that is updated on a regular basis.   Closed access publishers often do not make their text available for TDM or only do so under certain restrictions. Info:

17 Open Access & TDM This slide is a few years old, and is taken from a talk on OA by a researcher called Ross Mounce, but I’ve included it because it demonstrates some of the possibilities of OA – and I like the idea of having all of PLOS on a single USB stick. Credit: Ross Mounce, “Open Access for Early Career Researchers”, University of Bath Open Access Week session; 23rd October CC-BY 4.0.

18 PLOS API (Non-Bulk Downloads)
PLOS provides three ways to access data about PLOS articles or the articles themselves. JATS XML: structured data | article text & metadata Article PDF: limited TDM utility | useful for reading offline HTML Article Page: less useful for TDM PLOS also provides 3 ways to access data about PLOS articles or the articles themselves. [These methods are not as useful for bulk downloads but do provide anyone with specific interest in PLOS articles and data a way to access it.] JATS XML The Journal Archive and Interchange Tag Set (JATS) is the standard used to archive scientific articles.   JATS XML is the most convenient format for TDM because the data is structured.  Article text and meta-data can be accessed in a single file and in standard way.  Downloading individual article XML from the PLOS website is simple if the DOI of the article is known. Article PDF Each PLOS article is also available as a PDF. Article PDF’s have limited utility for TDM but are useful to printing or reading the article offline.  Html Article Page Article HTML is the primary method used to view PLOS articles online. Scraping the article HTML is a technique used by search engines to index articles and can be used for TDM. It is generally less useful for TDM because the article pages change over time, the data is not structured and meta-data is not easily identified. Info:

19 Post-talk update:

20 In conclusion… PLOS’ mission is to encourage & enable reuse of our content. We aim to make TDM easier through our: Technology Licensing We’re open to suggestions. IMAGES WITH TEXT ON RIGHT

21 Questions & Comments?

Download ppt "PLOS Facilitating Text & Data Mining The Role the Publisher Can Play"

Similar presentations

Ads by Google