Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015
Production Infrastructure: end of old « Petabox » architecture Harvesting Focused crawls, notably electoral crawl in December 2015 Annual domain crawl, September-November Budget of 120 Tb in 2015 against 100 in 2014 Preservation Maintaining a sufficient ingesting ratio (ingesting should run faster than harvesting!) Accept WARC format in the current ingesting channel Start new ingest channels (BnF crawls with HTTrack, , crawls performed by IA for the BnF, ) 2
Legal deposit of ebooks Two working groups Grouping together representatives of BnF, SNE (main publishers union) and Ministry of Culture Legal side: adapting the decree on internet legal deposit Technical side Working with e-distributors to get data and metadata Agreeing on the formats for data (EPUB, PDF) and metadata (ONIX) Setting up an internal workflow Entry with FTP deposit Re-use of already exiting applications: SPAR, Gallica… Important issue: automation of cataloguing Real-scale tests to start in April-May
Development: access projects Data mining Project supported by a specific research fund Cooperation with a research library (specialized in 20 th century history) and a technology university (Telecom ParisTech) Goal: studying the use of digitized documents on WWI web On BnF side Extracting metadata in “WAT” files Providing a server where a researcher can “play” with the data … while respecting legal deposit regulations On Telecom ParisTech side Recruit a researcher which will use the BnF services to perform the study Who arrived in January 2015, for 6 to 9 months 4
Development: access projects Data extraction A follow-up of the data mining project Being able to extract specific data from legacy W/ARC files… … according to filters: by domain names, MIME types, dates. etc. Interested in the use of WARC tools and JWAT Full-text indexing On specific corpora: news, government websites Probably using SOLR and tools developed by BL and other IIPC partners Should start in June
Remote access Remote access: a legal possibility Regulation of the Ministry of Culture: a remote access to web archive should be offered to the 26 regional libraries… …corresponding to the 26 French regions, including overseas regions Technically: use of a “virtual brower” Use of VMware “View” solution A progressive deployment Already available in 2 libraries, 2 new openings up to March Goal: 8 libraries end 2015, 15 end In parallel, the BnF proposes to libraries to maintain an “ongoing collect” to harvest their regional web 6
Development: harvesting projects Investigating the adoption of Heritrix 3 Identifying the benefits, the shortcomings, the opportunities and the risks In close relationship with… you More information to come! Crawl of FTP platforms with Heritrix Issue: the BnF is not able to get the paper versions of local editions of main regional newspapers So it tries to get the online PDF version Currently we crawl the websites… but we would like to investigate “FTP deposit by robot” with Heritrix It’s just a teaser for the end of the workshop… 7