Download presentation
Presentation is loading. Please wait.
Published byKevon Cleek Modified over 9 years ago
1
Harvesting digital newspapers at the Bibliothèque nationale de France
Géraldine Camile Bibliothèque nationale de France Tallinn, 20 mn de présentation Définitions: Metrics: we talk about concepts and definitions Statistics: we talk about numbers and results
2
Summary Context and objectives of the “subscription-based press project” Harvesting news websites with robots Results and lessons learnt The future of the project – and its alternatives I’ll present you the approach adopted at BnF to collect online newspaper thanks to web harvesting technologies This project is called the subscription-based press project So first I’ll present the context and objectives of the projects Then there will be a more technical description of the workflow I’ll finish by the results, the lessons learnt, and the next steps of the project 2
3
Context and objectives of the “subscription-based press project”
4
Collecting digital news at the BnF
Harvesting of news websites since 2010 Use of crawlers 100 news websites harvested every day Only freely accessible content Using robots to collect digital equivalents of newspapers “Subscription-based” press project Obtain passwords from publishers and crawl protected content Focus on the PDF versions to ensure collection continuity As microfilming budgets for local editions of regional newspapers are decreasing What is the context? At the BnF, we harvest news websites since 2010 We use harvesting robots, called crawlers, to collect 100 news websites every day Up to last year, it was only on the freely accessible part of the website However, many parts of the website, often the most interesting one, are only accessible upon payment. As the law on internet legal deposit allows the BnF to ask for password, We decide to ask press publishers for passwords to collect the protected content: This is the idea behind the “subscription-based” press project And we also decided to focus on the PDF versions of local edition of regional newspapers. In fact, their paper version is not collected by BnF anymore, as they were microfilmed. And microfilming budget were decreasing, so we needed a replacement solution. 4
5
The subscription-based press project
Various actors within the Library Law, Economy and Politics department Legal deposit department: printed periodicals service Legal deposit department: digital legal deposit service IT department Different skills and approaches for printed and digital periodicals Calendar A one-year experiment Started end 2012; assessment end 2013 Now in production mode This project grouped together various actors within the library … This combination was a way to associate skills and approaches towards printed and digital periodicals It was a one-year experiment that started… 5
6
Harvesting news websites with robots
Source:
7
The harvesting workflow
Contact with publisher Technical instruction Selection Curators Engineers Web harvest Description on access UI So how does it work Selection of news title Then contact with publisher, which may take a lot of time … There is a sampling for quality assurance Who does what; what are the professional profiles involved in this activity? Curators Quality assurance Engineers Cataloguing Preservation Cataloguers Library assistants
8
Cataloguing… Type: digital document Format Link to the archives Local editions We designed a system to catalogue web archives within the General Catalogue Link with the printed edition record August 20th 2014 Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference
9
And access in the archives…
The title is accessible through web archives If you clic on a specific date You select then your local edition And here you have the document The result on the collection of the newspaper The choice of the local edition An example of a local edition Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014
10
A guided tour of the news collection
Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014
11
Long term preservation in SPAR, BnF’s digital repository
Harvesting press websites at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014
12
Results and lessons learnt
13
The collections 22 titles 192 local editions Start of harvest
Ouest-France 53 July 19, 2012 Le Républicain lorrain 8 December 12, 2012 Le Progrès 18 April 16, 2013 Midi libre 14 May 2, 2013 L’Indépendant 3 Centre Presse 1 La Tribune May 22, 2013 Mediapart July 16, 2013 La Montagne October 10, 2013 Le Populaire du Centre La République du Centre 2 Le Berry Républicain L’Écho Républicain Le Journal du Centre Le Dauphiné libéré 20 April 7, 2014 Les Dernières Nouvelles d'Alsace L'Est Républicain 10 L'Alsace Le Journal de Saône-et-Loire 7 Le Bien Public 4 Vosges Matin The collections Twenty-two titles, representing one hundred ninety-two local editions August 20th 2014 Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference
14
Map of the daily regional newspapers
Harvested titles Vosges Matin La Liberté de l’Est A good coverage of French territory When there’s a can, there a collected title Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014 (n° 1, oct./nov. 2012, p )
15
Main achievements The collections!
Technical experimentations of harvest of protected content Creation of links between the General Catalogue and web archives Raising awareness among wider library staff about collecting digital publications Even library assistants are now managing digital documents Apart of course for the collections! Technical improvements: now we know how to collect protected content for websites, we could use this experiment for others contents Also creation of links It was also a way to raise awareness about collecting digital publications Now even library assistants are managing digital documents
16
The dark side of the crawl
News websites’ architecture may change very quickly Requires high reactivity and dedicated time of technical staff Difficulty to recover non-harvested collections Press collections disappear very rapidly from the publisher’s website Some websites are technically NOT possible to harvest with crawling robots But there are bad news Lire Source:
17
The future of the project – and its alternatives
18
The next steps of the project
Extend the harvest to new titles Improve access to collections A dedicated interface? Full-text index of the press corpus? Promote the service towards: Librarians at reference desks Researchers and other users Open remote access From the researchers desktops From regional libraries entitled to receive access to web legal deposit collections Strasbourg and Nancy have allready an access to BnF’ archives.
19
Success and alternatives
Identify alternative ways of collection Deposit from publishers through FTP? Deposit from press aggregators? Build upon the experience of the ebook deposit workflow A successful project… which needs to be complemented For the websites for which web harvest is not feasible, we could set up a legal deposit worflow such as the one of ebooks, With publishers of press agregators So in our opinion…
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.