BnF experiences with harvesting content beyond paywalls

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

UCL LIBRARY SERVICES Enhance the impact of your research with UCL Eprints Suzanne Tonkin Bartlett Library – Site Librarian UCL Eprints Project Officer.
A Future for UK theses, University of London, Senate House, 22-Jan-2004 E-thesis submission workflow issues Simon J. Bevan Information Systems Manager.
New Services for Data Creators and Providers Louise Corti, Head ESDS Qualidata/ Outreach & Training Alasdair Crockett, ESDS Data Services Manager.
Harvesting digital newspapers at the Bibliothèque nationale de France
WHY CMS? WHY NOW? CONTENT MANAGEMENT SYSTEM. CMS OVERVIEW Why CMS? What is it? What are the benefits and how can it help me? Centralia College web content.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
OPEN ACCESS Your Publisher of Choice DE GRUYTER OPEN Society-Pays Publishing Program.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
SpringerLink An overview (with a focus on eBooks!) Amber Farmer Licensing Manager, Scandinavia Discover More!
1 Managing Legal Deposit for Online Publications in Germany Cornelia Diebel.
IAEA International Atomic Energy Agency ICSTI 2013 Annual Members’ Meeting March 2013.
PANDORA Australia’s Web Archive Library Science Talks SNL/CERN, September 2004 Paul Koerbin Digital Archiving Branch National Library of Australia
Bibliothèque de l’Université LavalFaculté des études supérieures Guy Teasdale Access 2003 Vancouver - October 4, 2003.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
Depositing and Disseminating Digital Resources Alan Morrison Collections Manager AHDS Subject Centre for Literature, Linguistics and Languages.
ELPUB 2006 Bansko, 14 June 2006 E-publishing Infrastructure for Firenze University Press Patrizia Cotoneschi University of Florence E-publishing Infrastructure.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Creating and publishing accessible course materials Practical advise you can replicate.
1 Archive-It Training University of Maryland July 12, 2007.
Towards a new cooperation between libraries and educational institutions Matthieu BONICEL Bibliothèque nationale de France - CNRS.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Dealing with DRM and Digital Rights at the German National Library.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Management, marketing and population of repositories Morag Greig, University of Glasgow.
1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (
Content Management Interoperability Services (CMIS)
ORGANIZING AND STRUCTURING DATA FOR DIGITAL PROJECTS Suzanne Huffman Digital Resources Librarian Simpson Library.
Luc Audrain Hachette Livre Head of digitalization
E-books, E-audio, and Other E-content Instructor: Anthony Costa An Infopeople Workshop Fall 2006.
EU Bookshop Tips and tricks for advanced users + New features 2011.
1 Guidelines For The Future Sharing Best Practice For National Bibliographies In The Digital Era Neil Wilson Information Coordinator IFLA Bibliography.
5-7 November 2014 DR Workflow Practical Digital Content Management from Digital Libraries & Archives Perspective.
The TARO Project Texas Archival Resources Online Fred Gilmore Sr Operating Systems Specialist UT Austin General Libraries April.
Fundamentals of XML Management Greg Alexopoulos Systems Engineer Documentum.
Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.
Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:
Publisher’s Perspective: Digitization of print resources, and archiving of digital resources Judy Best, June 13, 2006.
Françoise Bourdon Deputy Head of the Digital and Bibliographic Information Department French National Library IFRRO International seminar Oslo, October.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
Word Lesson 13 Sharing Documents Microsoft Office 2010 Advanced Cable / Morrison 1.
Library of Vilnius Gediminas Technical University Asta Katinaitė, Aurelija Striogienė
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November
GPO’s Federal Digital System December 10, 2009 U.S. Government Printing Office.
Uncovering the Invisible Web. Back in the day… Students used to research using resources hand-picked by librarians and teachers. These materials were.
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
LIALIA The LIA Project Italian Accessible Books London Book Fair– April, 11 th.
Greater Visibility, Greater Access QSpace QSpace Queen’s University Research & Learning Repository.
Using JSTOR May 2016.
Project Objectives Publish to a remote server
Institution update KB DK
Welcome to the Engineering Workbench Community!
Building A Repository for Digital Objects
BnF - DLWEB - Umbra & Heritrix 3
BnF experiences in using NAS 5 And Heritrix 3
VI-SEEM Data Repository
VI-SEEM Data Repository
First-Year Writing Portfolio Project
NexGen Data Entry is a premier outsourcing company in India providing the best IT enabled business process outsourcing services globally. We offer a wide.
NIMAC for Publishers & Vendors: Delivering Files
Data catalogues and the data repository ADMIRe JISC MRD
Title Management & Product Marketing
NIMAC for Publishers & Vendors: Batch Delivery Procedures
Enhancing ICPSR metadata with DDI-Lifecycle
The Global Digital Library will increase the availability of high quality learning resources in underserved languages worldwide.
Márton Németh – László Drótos How to catalogue a web archive?
Microsoft AZ-500 Dumps PDF
Presentation transcript:

BnF experiences with harvesting content beyond paywalls BnF - DLWEB - Umbra & Heritrix 3 BnF experiences with harvesting content beyond paywalls Géraldine Camile geraldine.camile@bnf.fr NAS Workshop, Vienna, 27 April 2017

The subscription news sites

Harvest of subscription news sites Since 2012, BnF crawls subscription news sites

Focus on the PDF versions Focus on regional newspapers to ensure collection continuity as microfilming projects for local editions are stopped

The harvesting workflow

The main difficulties News websites architecture may change very quickly Requires high reactivity and dedicated time of technical staff Difficulty to recover non-harvested collections Press collections disappear very rapidly from the publisher’s website No relation between date of the archives and date of the editions Some websites are technically NOT possible to harvest with crawling robots And PDF format is disappearing from the websites

FTP harvest PDF format is still used by the publishers to print the editions Publishers make their files available for BnF on a FTP server The BnF crawls the files with Heritrix daily Publishers choose the rythm of file clean-up

Benefits/inconveniences The benefits: Harvest of all the titles of the groups (against title by title) Decrease of instruction time for the BnF’s engineers Stability of the system It’s easier for the publishers to offer this service than to evolve their websites The inconveniences: Specific indexation and access development to view the files in OpenWayback The filenames are sometimes not meaningful for the title or the editions ftp://transfert-presse-sp.e-i.net/170424/LDL_74BDE_20170424.PDF) The link between the context and the publication is not straight

Merging of the harvest templates (1/2) From 9 harvest templates to 2: By http or html authentication By FTP

Merging of the harvest templates From 9 harvest templates to 2: By http or html authentication By FTP it’s possible to create “sheets”

Towards a deposit? BnF is currently evaluating the possibility to put in place a deposit workflow

ebooks

The BnF approach : deposit against harvest Harvest of parts of websites accessible upon payment In use for online newspapers Entry, preservation and access workflows already in place Why choosing FTP direct deposit? Maintain close relationships with publishers In most cases, ebooks aren’t directly downloadable from the website Allows cataloguing each document Cooperation with publishers and distributors They provide metadata files in ONIX format Which we can reuse! Easy-to-manage ebooks No DRM, no closed-source formats PDF and EPUB (versions 2 and 3)

Bibliographical products Reception FTP platform Extranet for publishers Sorting and first checks Nouveautés Editeurs Metadata Waiting area Preparation for preservation Diffusion Creation of the document in the BnF information system Catalogue M SPAR : negotiated legal deposit track Automatic but rich bibliographic record Gallica intra Muros Livnum_A Channel Livnum_B Channel ADCat-02 Version anglaise EPUB and PDF readers, secured environment Virtual trolley Diffusion zone Complete bibliographic record Preservation Bibliographical products Description