Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1.

Similar presentations


Presentation on theme: "Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1."— Presentation transcript:

1 Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1

2 BnF needs Objectives Characterize BnF web collections Manage the activity of the digital legal deposit team Describe BnF web data to be preserved Two main kinds of metrics: harvesting and preservation From experimentation… Scripts and Heritrix reports from Internet Archive engineers A dedicated application developed by BnF engineers 2

3 International environment … to the standardization of the metrics Definition of concepts and standards with an ISO working group Dedicated statistics have to be included in the Library general performance statistics Experience sharing within the IIPC Many libraries have changed from ARC to WARC BnF has developed a specific tool (NAS_qual) for its first internal broad crawl in 2010 3

4 Benefits of ISO report for BnF Adoption of strict definitions of terms Main metrics chosen for collection development At a more refined level, collection characterisation 4 StatisticPurposeExample Number of targetsObjectives of the collection8,000 targets Total number of URLQuantity of information in web archive14 billion URL Total compressed size storedOverall size of web archive200 terabytes Number of container filesNumber of conservation units in archive18,000 WARC files StatisticPurposeExample Distribution by top level domainGeographic distribution70 % of collection in.fr TLD Distribution by format typesDocument type characterisation60 % of collection in text/html

5 BnF method A table lists and characterizes all possible metrics A code and a name for each one The source report The calculation method The needed tool (scripts, NAS_qual…) Main difficulties Difference of scale between broad and selective crawl Compressed or uncompressed size Collected or processed URLs 5

6 ProductionPreservation SourcesNetarchiveSuite (and Heritrix)SPAR Statistics toolsNAS_qualSPAR indicators ExploitationExcel files 6

7 Description of top level domain 7 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation Number of TLD Number of unique first top level domain harvested Hosts- report.txt – N files Extract the TLD from the name of the hosts. Add the different TLD with at least one URL harvested. Be careful: a TLD can have several occurrences in several host-reports. To characterize a collection in terms of geographic distribution (e.g. France)

8 Statistics on top level domains Starting with a seed list of.fr domains, we can see that French scope also includes a large part of.com domains, and also European domains 8 TLDNumber of URL% fr1,050,488,16343.3 % com952,199,48439.3 % net105,871,6644.4 % org104,451,3504.3 % eu29,396,6131.2 %

9 Description of MIME types 9 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation Number of MIME types Number of unique MIME types harvested mimetype- report.txt - N files Add the unique MIME types. Be careful : a MIME type can have several occurrences in several MIME type reports. Get a distribution by content types comparable to other documents in a library Help preservation tasks

10 Statistics on MIME types We can note that around 1 million audio and video files will need special attention to be preserved 10 MIME type (by categories)Number of URL% text1,34, 647,19055.7 % image947,101,13839.3 % application114,668,7704.8 % video2,120,7190.1 % audio1,837,2070.1 %

11 Description of WARC files volume 11 Metrics descriptionMetrics elaboration NameDescriptionData sourceCalculation WARC files volume (compressed with metadata WARC) Weight in bytes (and in Go) of all the conservation units produced / data harvested Manifest of storage servers Add the weight of all the WARC files of one or several harvests (configurable). Manage the storage space Help preservation tasks

12 Statistics on WARC files volume BnF uses a similar way to count ARC and WARC files 99.35 Tio in 2014 567.38 Tio for the entire BnF web collections Question: BnF still hesitates to convert bytes to Go or Gio, To or Tio? 12

13 Communication to users Comments by the digital legal deposit team Describe the web archive Discussion with the IT team Define annual storage volume Define number of crawlers Content librarians network Cooperate on selective crawls BnF managers and readers Disseminate figures on the annual report, the BnF website, the legal deposit observatory 13 15 metrics 5 metrics 4 metrics 2 metrics or more

14 Conclusion Some usage limitations Unused metrics Bugs and errors Lack of analysis Some changes in the environment Adaptation to Heritrix 3 Options for new tools Evolution of standards Even though, we are able to compare our production and our collections with other institutions 14


Download ppt "Annick Le Follic Bibliothèque nationale de France Tallinn, 2015-01-29 1."

Similar presentations


Ads by Google