Use cases for BnF broad crawls Annick Lorthios
2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the BnF (instead of by Internet Archive) Workflow: Planning Crawl design Crawl monitoring Quality control Lessons learnt Number of harvested URLs 530 to 800 millions Compressed weight20 to 30 TB Duration5 to 10 weeks Goals:
3 Planning Goal is to avoid technical, content or legal risks Beginning of the project to use NAS as our main tool Contract (SLA) between IT team and digital curators Responses to producers requests & complaints
4 Crawl design Constitution of the seed list Use of a special pre-load database Choice of general settings Configuration of the architecture Tests of NAS
5 Crawl monitoring Objectives: make it possible to finish the crawl in the defined time Consultation of NAS monitoring interface Intervention in Heritrix monitoring interface to check certain domains Overview of the Web media
6 Quality control Use of an external tool for statistics & metrics 15 % of 2010 broad crawl collection is video and audio files Use of Wayback Machine to browse the visual result (samples!) Number of harvested URLs 832 millions Compressed weight 23,6 TB Duration12 weeks ARC files240,000
7 Lessons learnt NAS is maintained for the organisation of future crawls NAS made teams invent new forms of relationship between IT team and digital curators NAS is good for configuring harvest definitions but we must be careful not to create too many seed lists YES
Selective crawls at the BnF Peter Stirling
9 Current organisation of selective crawls Selective crawls to complement broad crawls (sites outside.fr, sites to be collected in depth…) Three main kinds: Thematic (BnF departments) Project (elections, blogs…) Cross-departmental (news, large sites) Nomination of sites by a network of curators at the BnF, and external partners Currently handled directly with Heritrix, in process of transferring workflow to NAS
10 A community of librarians Network of correspondents, c. 80 librarians across the different thematic departments of the BnF One coordinator for each department; 1 to 15 correspondents, depending on the size of the department Capitalise on subject knowledge Engage librarians in Web archiving activity, make it business as usual at the BnF In practice: meetings with coordinators to define collection policy, training sessions, workshops…
11 Tools Previous curator tool (GDLWeb) allowed curators to nominate sites For election crawls, a special nomination tool allows remote access and classification of sites nominated Curators define seed, depth, frequency and budget Validation by web archiving team and transfer to IT team for crawling with Heritrix A new curator tool to work directly with NAS will be developed in first semester of 2011
12 Size/frequency of selective crawls Thematic crawls generally performed once a year; other frequencies to be put in place with NAS Project crawls can be more frequent (twice a year, multiple crawls during elections…) c. 20,000 seed URLs across all selective crawls Ranges from c. 50 (Maps Department) to almost 6,000 (Elections 2007) Estimation 2010: 185 million harvested URLs, 12 TB
13 Cross-departmental selections Tests currently underway on crawls of news sites and large sites, to be launched in October Sites that have an interest to the whole library, and that have specific technical needs (daily crawls, crawl in depth) c. 80 news sites, 10 large sites (up to several million URLs) Developments to NAS: monitoring of jobs
14 Monitoring a job with NAS History of job Graphs to show progress List of longest queues (also as CSV file) Access to Heritrix console
15 Tests on news/large sites Positive : Use of NAS to automatically launch daily crawls Monitoring made easier Access to information on jobs within NAS interface Negative : Queues by domain, not by host Budget management Use of general or domain-specific filters, not possible to filter differently by Harvest Definition Still need for external tools (quality control…)
16 Transfer of existing workflow to NAS Use of other frequencies (demand expressed by curators): continuous crawling How do we keep our current organisation while adapting it to NAS? Use of Harvest Definitions… Further developments NAS: budgets, filters, quality control… Development of curator tool: Management of selections by curators Validation by web archiving team Interaction with NAS