Presentation is loading. Please wait.

Presentation is loading. Please wait.

BnF experiences with harvesting content beyond paywalls

Similar presentations


Presentation on theme: "BnF experiences with harvesting content beyond paywalls"— Presentation transcript:

1 BnF experiences with harvesting content beyond paywalls
BnF - DLWEB - Umbra & Heritrix 3 BnF experiences with harvesting content beyond paywalls Géraldine Camile NAS Workshop, Vienna, 27 April 2017

2 The subscription news sites

3 Harvest of subscription news sites
Since 2012, BnF crawls subscription news sites

4 Focus on the PDF versions
Focus on regional newspapers to ensure collection continuity as microfilming projects for local editions are stopped

5 The harvesting workflow

6 The main difficulties News websites architecture may change very quickly Requires high reactivity and dedicated time of technical staff Difficulty to recover non-harvested collections Press collections disappear very rapidly from the publisher’s website No relation between date of the archives and date of the editions Some websites are technically NOT possible to harvest with crawling robots And PDF format is disappearing from the websites

7 FTP harvest PDF format is still used by the publishers to print the editions Publishers make their files available for BnF on a FTP server The BnF crawls the files with Heritrix daily Publishers choose the rythm of file clean-up

8 Benefits/inconveniences
The benefits: Harvest of all the titles of the groups (against title by title) Decrease of instruction time for the BnF’s engineers Stability of the system It’s easier for the publishers to offer this service than to evolve their websites The inconveniences: Specific indexation and access development to view the files in OpenWayback The filenames are sometimes not meaningful for the title or the editions ftp://transfert-presse-sp.e-i.net/170424/LDL_74BDE_ PDF) The link between the context and the publication is not straight

9

10 Merging of the harvest templates (1/2)
From 9 harvest templates to 2: By http or html authentication By FTP

11 Merging of the harvest templates
From 9 harvest templates to 2: By http or html authentication By FTP it’s possible to create “sheets”

12 Towards a deposit? BnF is currently evaluating the possibility to put in place a deposit workflow

13 ebooks

14 The BnF approach : deposit against harvest
Harvest of parts of websites accessible upon payment In use for online newspapers Entry, preservation and access workflows already in place Why choosing FTP direct deposit? Maintain close relationships with publishers In most cases, ebooks aren’t directly downloadable from the website Allows cataloguing each document Cooperation with publishers and distributors They provide metadata files in ONIX format Which we can reuse! Easy-to-manage ebooks No DRM, no closed-source formats PDF and EPUB (versions 2 and 3)

15 Bibliographical products
Reception FTP platform Extranet for publishers Sorting and first checks Nouveautés Editeurs Metadata Waiting area Preparation for preservation Diffusion Creation of the document in the BnF information system Catalogue M SPAR : negotiated legal deposit track Automatic but rich bibliographic record Gallica intra Muros Livnum_A Channel Livnum_B Channel ADCat-02 Version anglaise EPUB and PDF readers, secured environment Virtual trolley Diffusion zone Complete bibliographic record Preservation Bibliographical products Description


Download ppt "BnF experiences with harvesting content beyond paywalls"

Similar presentations


Ads by Google