Download presentation
Presentation is loading. Please wait.
Published byMark Burke Modified over 6 years ago
1
Building on the shoulders of Giants: the Scholarly Web
Tim Brody Intelligence, Agents, Multimedia Group University of Southampton 28/11/2018
2
This Talk The “Research Literature” The Open Access Literature
Why Open Access? Open Archives Initiative Citebase Search Open Access effect on research Summary & URLs This talk is my impression of how the research world could be improved, and tools that I have produced to do that. What I am concerned with is the research literature: reports written by researchers, that are published as part of the public record. While the majority of research literature is accessible only through paying access charges (to cover the cost of printing, distribution, etc.), the Internet gives us a new possibility for Open Access literature. Very little of the current research literature is Open Access. By Open Access I mean that it is free at the point at which the user views the full-text. But it is the authors that need to be convinced of the benefits in order to get Open Access to the research literature. The Open Archives Initiative provides a technical framework for the way we would like to see Open Access literature implemented, down scaling the IPR holders to simply providing the information, while others develop the services and add-ons (although, of course the IPR holder may also provide services, but they shouldn’t use the IPR monopoly to gain a service monopoly). Citebase Search is my own OAI service provider, that provides a citation database add-on to e-print archives. Using the citation database in Citebase we can look at the effect of Open Access e-print archives on how research is reported (and perhaps how research is done). And lastly I’ll give some URLs for more information on what we do. 28/11/2018
3
The Research Literature
The grey literature Technical reports Monographs Presentations Royalty literature Books Refereed journal corpus What is the Research Literature? The research literature is the way that researchers communicate ideas and results with one another. How this communication is done depends a lot on the subject. For example Physicists make use of a lot of technical reports (hence the success of the LANL tech report server). This informal communication – or ‘grey’ literature – also includes monographs and presentations, but are all widely used means to communicate ideas and results. Much of the grey literature is already Open Access. While the Sciences tend to rely on reports, Humanities produce significant numbers of books. It is unlikely that books – where the author derives royalty payments – will ever be Open Access, as it isn’t in the author’s interest. However, where an author’s academic assessment may rely on their research output, we still need the ability to assess the impact of royalty literature (perhaps by providing Open Access to the book’s references). Of primary interest to researchers – certainly within the Sciences – is the peer reviewed literature. This is mainly the journal literature, but also, particularly in my own field of Computer Science, conference proceedings. And it is this literature that we are most interested in converting to an Open Access model. 28/11/2018
4
The Refereed Journal Literature
Written without the expectation of royalties Akin to “Advertising” for the authors and their work Reviewed for free by peers Est. 20,000 Peer-reviewed Journals B.L. archives 60,000 serials Est. 2,000,000 Articles Annually What makes the refereed literature different to other published material is that the authors do not expect to be paid for the reports that they produce. In fact, it is quite the opposite. While authors of novels and artists are concerned about theft, it is in the interest of the research authors to have their work as widely read and copied as possible (and hence be more recognised, greater impact and so on). In this way the refereed journal literature is more akin to advertising for the author, than other forms of publication. But what distinguishes the refereed literature from advertisements is that it is reviewed and certified by peer experts, and without that certification (and the reputation of the journal it is published in), the report will be unlikely to be read and built on by other researchers. The journal literature is a huge volume of work, a volume of work that researchers only have partial access to (given an estimated 20,000 journals most Universities only have access to 4 or 5 thousand). 28/11/2018
5
Pre-Print reviewed by Peer Experts – “Peer-Review” 12-18 Months
Impact cycle begins: Research is done Researchers write pre-refereeing “Pre-Print” Submitted to Journal Pre-Print reviewed by Peer Experts – “Peer-Review” 12-18 Months Pre-Print revised by article’s Authors Refereed “Post-Print” Accepted, Certified, Published by Journal Although I’m sure everyone is familiar with the ping-pong of journal publication, it probably helps to define the system that I’m talking about, and how Open Access may help to improve it. Researchers can access the Post-Print if their university has a subscription to the Journal New impact cycles: New research builds on existing research
6
Open Access Literature
Research Archives (“self-archiving”) 250,000 arXiv.org 500,000 citeseer 1,000s in institutional & other repositories Open Access Journals BioMed Central Time-delayed access PubMed Central 500,000 HighWire Press Personal Web pages So how much of the journal literature is Open Access? The most high profile of the Open Access repositories given by the self-archiving/open access proponents are arXiv.org and Citeseer. As these are collections built from “self-archived” reports (i.e. author contributed), it is difficult to know whether the author contributed report has been published in a peer-reviewed journal – certainly my experience of Citeseer is that it contains a lot of informal technical reports. Even so, these archives provide a hint to how Open Access literature could work. Open Access publication falls under two branches, either upfront charges of which Bio-Med Central’s e-journals are a good example, or time-delayed access. (It’s worthwhile noting that virtually no money is made charging access to literature 6 months after it is released, hence many publishers are happy to release their articles after a time delay as a way to reduce the pressure to provide Open Access) A large and unknown amount of literature exists in informal, personal archives. But without some structure behind personal archives it is difficult for users and services alike to tell the difference between technical reports and peer-reviewed literature. Regardless of how access may be provided, the current literature available free at the point of access is a fraction of the literature produced annually. 28/11/2018
7
“Skywriting”: All research, accessible to all potential users, anywhere, anytime
Impact cycle begins: Research is done Post-Print self-archived to University’s Eprint Website Researchers write pre-refereeing “Pre-Print” Pre-Print self-archived to University’s Eprint Website Submitted to Journal Pre-Print reviewed by Peer Experts – “Peer-Review” New impact cycles: Self-archived research impact is greater (and faster) because access is maximized (and accelerated) 12-18 Months Pre-Print revised by article’s Authors Refereed “Post-Print” Accepted, Certified, Published by Journal Now adding on one method of Open Access – author self-archiving – we can see how it fits into the existing publication cycle. Researchers can access the Post-Print if their university has a subscription to the Journal New impact cycles: New research builds on existing research 28/11/2018
8
Why Open Access? Maximise research impact through maximised access
Efficiency ADS Est. to provide $250 million benefit to astronomy Continuous and comprehensive assessment Periphery benefits Publicly funded research publicly accessible 3rd World Access (and even some 1st World!) Easier to identify plagiarism (do a Google search!) So how do we convince the stakeholders that Open Access is a good idea? If authors publish papers for impact (hence improve their reputation, get grants, tenure and so on), then that impact must be limited by the amount of access there is to their work. So to maximise impact authors need to maximise access, so both publishing in the high impact journals (which certifies the work as being high quality), but also allowing as many eyeballs as possible to see their work. Speed and access is an important component in communication. The slower the communication, the less access, the more difficult it is for ideas to spread and develop. A recent paper estimated that the Astrophysics Data System provided a 250 million dollar benefit to worldwide astronomy, a system that joins up all astronomy publications and data sources. There are also strong arguments for other stakeholders for Open Access. Currently there is the anomaly that tax payers fund public research, and yet can not access the results of that research (the peer-reviewed literature). Open Access would allow researchers in the 3rd World to have equal access to the research literature. 28/11/2018
9
Separate Content from Services
On the web: use a full-text search engine Research literature: A&I, publisher, library, aggregator, journal contents, society … Create the Scholarly Web: Many of the new e-journals and corporate publishers only provide access to their literature through their own services. That means if you want access to the literature, you must also pay for search engines, Web editors, and so on. I think it would be a much more honest, and competitive, system if the content was separated from services. While the bodies that serve research papers would charge for the peer-review service and to maintain the quality of writing and formatting, the add-on services (search engines, navigation etc.) can be created by anyone and competed with on the basis of quality rather than content. Given this “Scholarly Web” all services would have access to all the literature. Doing a comprehensive literature search would be as simple as searching the Web with Google, compared to having to search many sources because every service has access to different subsets of the research literature. Coupled with citation links and the world could have a comprehensive “Google” for the research literature. 28/11/2018
10
OAI Protocol for Metadata Harvesting
Service requests metadata records by All records Created/updated between given dates Subset Repository returns metadata records Metadata record is XML To be OAI compliant must support at least Dublin Core The OAI-PMH is a Web protocol for transferring metadata between repositories (for example e-print archives) and services. A service harvests records by requesting records, with the option of only getting new or updated records since a certain date, or by a subset (if the repository exposes sets). The records returned are a combination of a header (with repository-unique identifier and datestamp) and a metadata record in XML. Any metadata that can be XML encoded can be exported, although to be OAI compliant the repository must support at least Dublin Core (as well as any other formats). 28/11/2018
11
OAI-PMH Separates content (repositories) from services Open protocol
Cheap to implement, flexible in use Scalable? 4 million records from OCLC HTTP-like caching techniques (OAI-PMH can be used in closed systems) 28/11/2018
12
Citebase Search: OAI Service
1000 users per-day (“visits”) 7000 hits 260,000 full-text records 6 million references Of which 1.3 million linked to full-text 3 million Web download hits (uk.arXiv.org) Some general information on Citebase. Users are predominately following links from arXiv.org abstract pages to get the reference and citations-to for the article they are looking at. The Web download logs are used to provide a “reading” measure for arXiv articles, and to compare between that read impact and citation impact scores. 28/11/2018
13
Citebase Search This is an overview of what Citebase does. Metadata is harvested using the OAI-PMH (the “discovery” phase). The metadata harvested via OAI is the title, abstract, authors, and citation. For each article found the full-text is retrieved, reference list extracted and stored. The references are then linked by looking up the equivalent citation in the metadata store. Users can then search and navigate the citation database through a Web interface. Citebase also provides an OAI interface that allows the metadata with linked references to be harvested. 28/11/2018
14
Citebase Search This is the user interface for Citebase, which has the familiar meta-field search but also the ability to rank results by a number of criteria including citation impact. 28/11/2018
15
Citebase Search The abstract page shows the usual title/authors/abstract and some analysis of the current article. The graph shows over time when the paper has been cited and when it has been downloaded. 28/11/2018
16
Citebase Search: Navigation by Citation Links
Article with reference list Future Reference link Related Current Article Co-cited Following the abstract are links to related pages by citations. These links can go backwards in time using the reference list, forwards in time by what has cited me, and sideways by either related or co-citation. Related papers are papers that have a similar reference list – often where an author has used the same references more than once! Co-cited is where two papers have been cited next to each other, the same as author co-citation. However co-cited papers can only be found for articles that have been cited, hence can’t be used for new articles. Past 28/11/2018
17
Citebase Search cites cites 28/11/2018
This is the reference list, as parsed from the full-text. “eprint” takes the user to the Citebase abstract page of the cited article, journal are bespoke links for the American Physical Society journals. 28/11/2018
18
Citebase Search cites cites 28/11/2018
Articles that have cited the current article, following these links will take the user towards newer papers. 28/11/2018
19
Citebase Search “Co-cited” 28/11/2018
And co-cited articles. The development version of Citebase also includes Related articles. 28/11/2018
20
The Effect of Open Access
Based on arXiv.org Oldest and most comprehensive online archive Correlation of Citation Impact with Web Impact (downloads) Effect of Open Access on citation behaviour As well as using Citation links to navigate the literature, they can be used to find patterns in the research literature. 28/11/2018
21
This graph divides papers into 3 sets by citation impact: bottom, middle 50%, and top quartile. This is a fairly arbitrary division as the distribution of citations is a exponential decay, that is decreasing numbers of papers get ever increasing numbers of citations. The hits to papers in each subset is then plotted against the time delay between the paper being uploaded and then downloaded. High impact papers receive more downloads, and over a longer period of time. That they receive more hits over time is a reflection of users following citations to articles, whose delay is determined by the research cycle. 28/11/2018
22
Citation Latency The duration of the research article cycle can be estimated by looking at the delay between articles appearing, and then being cited. We refer to this as “Citation Latency”. Within arXiv.org this is measured by taking the repository generated datestamps. Of course authors will read articles outside of arXiv.org, leading to different real-world latencies. 28/11/2018
23
We can look at how the citation latency has changed over the lifetime of arXiv.org by separating the papers into subsets by the year of the cited article, so articles deposited in 2002 will have at most 19 months of citations. If we compare an early, 1994 (the 4th line up from the bottom), to a recent year, 2001 (4th line from top), you can see that the ramp up to the highest rate of citations (the “peak” citation rate), falls from a period of 12 months to only 2-3 months. This suggests that increased use of arXiv.org has reduced the duration of the research cycle between an article being posted and then cited. As this graph is un-normalised, the lines get higher as the number of articles increases in arXiv. 28/11/2018
24
Conclusions High impact papers are read more (and this can be measured online) Web downloads may be an pre-indicator of impact Faster access leads to reducing Citation Latencies Hence faster research cycles, higher impact, and more productivity 28/11/2018
25
Summary The Web makes Open Access research literature possible, and hence more effective scholarship Services compete without holding the literature hostage OAI allows repositories to concentrate on getting and storing the literature Citebase Search provides citation navigation for OAI archive(s) Or anyone else who wants to provide a service 28/11/2018
26
The Last Slide Tim Brody tdb01r@ecs.soton.ac.uk Citebase Search
(papers & presentations) Citebase Search EPrints.org (advocacy, answers & software) I am a doctoral student in the Intelligence, Agents, Multimedia Group at the University of Southampton working with digital library systems: Citebase Search, E-Prints UK, TARDIS & OAI. Prof. Stevan Harnad 28/11/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.