Harvesting digital newspapers at the Bibliothèque nationale de France

Slides:



Advertisements
Similar presentations
Texas Workforce Education Course Manual (WECM) 1995 – 2012
Advertisements

CAIRN Chercher : Repérer : Progresser 20/03/ { } CAIRN A mutualist approach for distributing online contents in the humanities APM 2009 Conference.
1 Bibliothèque Dieter Schmidt Resources and services MBA 2011 Martine Allègre September, 14th.
Resources and Services Bibliothèque Dieter Schmidt
Two Special Right Triangles
Implementing Springer eBooks John Hopkins, Account Development Specialist.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
OLA Library Building Award Architectural Design and Transformation Ottawa Public Library Sunnyside Branch submitted May 2012 BiblioOttawaLibrary.ca.
1 of 16 Information Access The External Information Providers © FAO 2005 IMARK Investing in Information for Development Information Access The External.
1 of 15 Information Access Internal Information © FAO 2005 IMARK Investing in Information for Development Information Access Internal Information.
E-resources for Asian Studies: survey results and discussion Inga-Lill Blomkvist NIAS Library & Information Centre EASL September 2010.
Suzanne Bell and Nathan Sarr University of Rochester River Campus Libraries Re-engineering the Institutional Repository to Engage Users.
National Diet Library Digital Archive Portal - PORTA - Gateway to digital information in Japan April 3, 2008 Hideki Takeuchi Planning.
SEM25-01 ETSI Documentation Service (EDS) Antoinette van Tricht Editor © ETSI All rights reserved ETSI Seminar.
DRIVER Long Term Preservation for Enhanced Publications in the DRIVER Infrastructure 1 WePreserve Workshop, October 2008 Dale Peters, Scientific Technical.
Library Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian 07 September
Electronic Resources in the EUI Library
Kristīne Pabērza Ministry of Culture State Agency Culture Information Systems Latvia Member States' Expert Group on Digitization and Digital Preservation.
Program Goals, Objectives and Performance Indicators A guide for grant and program development 3/2/2014 | Illinois Criminal Justice Information Authority.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
Addition Facts
JISC Conference 2004 Promoting E-Resources Clare Holmes Head of Sales, UK & Ireland.
Open Scholarship 2006 Bielefeld Academic Search Engine a Scientific Search Service for Institutional Repositories Open Scholarship 2006 New Challenges.
Social Sciences Collections & Research: a new content-based team Gillian Ridgley, Ian Cooke, Jerry Jenkins.
A centre of expertise in data curation and preservation DigCCur2007 Symposium, Chapel Hill, N.C., April 18-20, 2007 Co-operation for digital preservation.
European Commission – Directorate-General Communication 1 Lessons learned from the European Year of Volunteering Preparations for the European.
2013 Report Cards How to prepare and distribute 2013 district and school report cards.
12 June 2014 Library & IT Services 1 Renovating the Library: Creating Learning Spaces and Moving to E-Only Hans Geleijnse Library Strategy Consultant Tilburg.
Symposium on Road to sustainable land administration in Africa 7 th annual African conference and Exhibition on Geospatial information, technology and.
© S Haughton more than 3?
The Centre de documentation collégiale: your e-library focusing on education! Isabelle Laplante MLIS Head Librarian June 6, 2013.
Welcome to the Virtual Historian Getting started with the VH 2.0 Go to virtualhistorian.ca Select language of usevirtualhistorian.ca 2 Note: For.
A deepening of training needs in digital curation Claudia Engelhardt Framing the digital curation curriculum Florence, 6-7 May 2013.
Hybrid journals at Nature Publishing Group COASP 19 th September, 2013 James Butcher PhD Associate Director Open Publishing.
1 Advanced Archive-It Application Training: Quality Assurance October 17, 2013.
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
Addition 1’s to 20.
25 seconds left…...
Slide 1 of 29 Community news Slide 2 of 29 Nouvelles de la communauté…
Week 1.
1 Unit 1 Kinematics Chapter 1 Day
How Cells Obtain Energy from Food
How the University Library can help you with your term paper
Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
Lokman I. Meho, Ph.D. University Librarian Associate Professor of Political Science February 8, 2012 AUB Libraries: New Faculty Orientation Fall 2011.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
Role of librarians in the development of Institutional Repositories Susan Ashworth University of Glasgow.
14 mai 2007Evolution of Scientific Publications, Colloque de l'Académie des sciences1 Preservation of electronic publications mission Catherine Lupovici.
French ebook acquisitions at the University of Ottawa, Canada Tony Horava Associate University Librarian (Collections) ALA Annual June 26, 2010.
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
Building of the Digital library of Brno University of Technology Barbara Šímová /
P. 1 A review of interlending and document supply in France: 2014 Restructuring resource sharing: new organizations, technologies, methods – IFLA Satellite.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Aarhus. BnF main topics – 2013 – crawling side Keep crawling –Broad and focused crawls –Limit of 100 Tb Crawl of password protected content –“Press project”:
Françoise Bourdon Deputy Head of the Digital and Bibliographic Information Department French National Library IFRRO International seminar Oslo, October.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
Digital Archiving in the Hungarian Széchényi Library The story and the plans of the Hungarian Electronic Library Rome, 21. Oct István Moldován OSZK,
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
1 Quebec Digital Infrastructure The Year in Review Guy Teasdale ACCESS 2006 Ottawa, October
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
IFLA Satellite conference - Helsinki - 10 août 2012
Linked Open Data: Challenges and Opportunities for BAnQ
BnF experiences with harvesting content beyond paywalls
VI-SEEM Data Repository
Presentation transcript:

Harvesting digital newspapers at the Bibliothèque nationale de France Géraldine Camile Bibliothèque nationale de France Tallinn, 2015-01-30 20 mn de présentation Définitions: Metrics: we talk about concepts and definitions Statistics: we talk about numbers and results

Summary Context and objectives of the “subscription-based press project” Harvesting news websites with robots Results and lessons learnt The future of the project – and its alternatives I’ll present you the approach adopted at BnF to collect online newspaper thanks to web harvesting technologies This project is called the subscription-based press project So first I’ll present the context and objectives of the projects Then there will be a more technical description of the workflow I’ll finish by the results, the lessons learnt, and the next steps of the project 2

Context and objectives of the “subscription-based press project”

Collecting digital news at the BnF Harvesting of news websites since 2010 Use of crawlers 100 news websites harvested every day Only freely accessible content Using robots to collect digital equivalents of newspapers “Subscription-based” press project Obtain passwords from publishers and crawl protected content Focus on the PDF versions to ensure collection continuity As microfilming budgets for local editions of regional newspapers are decreasing What is the context? At the BnF, we harvest news websites since 2010 We use harvesting robots, called crawlers, to collect 100 news websites every day Up to last year, it was only on the freely accessible part of the website However, many parts of the website, often the most interesting one, are only accessible upon payment. As the law on internet legal deposit allows the BnF to ask for password, We decide to ask press publishers for passwords to collect the protected content: This is the idea behind the “subscription-based” press project And we also decided to focus on the PDF versions of local edition of regional newspapers. In fact, their paper version is not collected by BnF anymore, as they were microfilmed. And microfilming budget were decreasing, so we needed a replacement solution. 4

The subscription-based press project Various actors within the Library Law, Economy and Politics department Legal deposit department: printed periodicals service Legal deposit department: digital legal deposit service IT department Different skills and approaches for printed and digital periodicals Calendar A one-year experiment Started end 2012; assessment end 2013 Now in production mode This project grouped together various actors within the library … This combination was a way to associate skills and approaches towards printed and digital periodicals It was a one-year experiment that started… 5

Harvesting news websites with robots Source: http://mimas.ac.uk/data-harvesting-brings-cost-savings-with-jusp/

The harvesting workflow Contact with publisher Technical instruction Selection Curators Engineers Web harvest Description on access UI So how does it work Selection of news title Then contact with publisher, which may take a lot of time … There is a sampling for quality assurance Who does what; what are the professional profiles involved in this activity? Curators Quality assurance Engineers Cataloguing Preservation Cataloguers Library assistants

Cataloguing… Type: digital document Format Link to the archives Local editions We designed a system to catalogue web archives within the General Catalogue Link with the printed edition record August 20th 2014 Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference

And access in the archives… The title is accessible through web archives If you clic on a specific date You select then your local edition And here you have the document The result on the collection of the newspaper The choice of the local edition An example of a local edition Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014

A guided tour of the news collection Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014

Long term preservation in SPAR, BnF’s digital repository Harvesting press websites at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014

Results and lessons learnt

The collections 22 titles 192 local editions Start of harvest Ouest-France 53 July 19, 2012 Le Républicain lorrain 8 December 12, 2012 Le Progrès 18 April 16, 2013 Midi libre 14 May 2, 2013 L’Indépendant 3 Centre Presse 1 La Tribune May 22, 2013 Mediapart July 16, 2013 La Montagne October 10, 2013 Le Populaire du Centre La République du Centre 2 Le Berry Républicain L’Écho Républicain Le Journal du Centre Le Dauphiné libéré 20 April 7, 2014 Les Dernières Nouvelles d'Alsace L'Est Républicain 10 L'Alsace Le Journal de Saône-et-Loire 7 Le Bien Public 4 Vosges Matin The collections Twenty-two titles, representing one hundred ninety-two local editions August 20th 2014 Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference

Map of the daily regional newspapers Harvested titles Vosges Matin La Liberté de l’Est A good coverage of French territory When there’s a can, there a collected title Harvesting digital newspapers at the BnF – Clément Oury – IFLA WLIC conference August 20th 2014 (n° 1, oct./nov. 2012, p. 60-61)

Main achievements The collections! Technical experimentations of harvest of protected content Creation of links between the General Catalogue and web archives Raising awareness among wider library staff about collecting digital publications Even library assistants are now managing digital documents Apart of course for the collections! Technical improvements: now we know how to collect protected content for websites, we could use this experiment for others contents Also creation of links It was also a way to raise awareness about collecting digital publications Now even library assistants are managing digital documents

The dark side of the crawl News websites’ architecture may change very quickly Requires high reactivity and dedicated time of technical staff Difficulty to recover non-harvested collections Press collections disappear very rapidly from the publisher’s website Some websites are technically NOT possible to harvest with crawling robots But there are bad news Lire Source: http://www.motifake.com/bot-fail-bad-bots-robots-banned-blocked-demotivational-posters-65020.html

The future of the project – and its alternatives

The next steps of the project Extend the harvest to new titles Improve access to collections A dedicated interface? Full-text index of the press corpus? Promote the service towards: Librarians at reference desks Researchers and other users Open remote access From the researchers desktops From regional libraries entitled to receive access to web legal deposit collections Strasbourg and Nancy have allready an access to BnF’ archives.

Success and alternatives Identify alternative ways of collection Deposit from publishers through FTP? Deposit from press aggregators? Build upon the experience of the ebook deposit workflow A successful project… which needs to be complemented For the websites for which web harvest is not feasible, we could set up a legal deposit worflow such as the one of ebooks, With publishers of press agregators So in our opinion…