BnF - DLWEB - Umbra & Heritrix 3 NetarchiveSuite 5.3 Colin Rosenthal csr@kb.dk Sara Aubry sara.aubry@bnf.fr
Overview Heritrix 3 Integration in NetarchiveSuite Feedback on BnF migration from NAS 4
Heritrix 3 Integration in NetarchiveSuite https://sbforge.org/display/NAS/Heritrix+3+Integration+in+NAS
Feedback on BnF migration from NAS 4
Background Heritrix 1 has been in use at BnF since 2006 9 months project, started in July 2016 Tackled a variety of activities: lots and different kinds of tests data and metadata analysis template and crawler traps migration software evolutions organisation changes Get started by reading the release notes
Lots and different kinds of tests Appropriation: get a sense of what’s new/gone/different H3 is a much more technical tool Collections Are focused crawls working the same way/working better? faster Is the content quality the same/improved? less noise Do we crawl better specific content? https Tools Are there new features to prepare, monitor and QA crawls? Infrastructure Can we still run applications on our virtual server environment? yes Templates + crawler traps Can we still use our knowledge base? yes but…
Data and metadata analysis - 1 WARC revist records when using deduplication need to restart deduplication indices, impact on storage WARC/1.0 WARC-Type: revisit WARC-Target-URI: https://static.lexpress.fr/min/images/logos/svg/lexpress.svg WARC-Date: 2017-04-24T08:02:16Z WARC-IP-Address: 54.230.79.70 WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest WARC-Truncated: length WARC-Payload-Digest: 5GXMYH6VZWVSZRURVYXHWYJKNWUE65BR WARC-Refers-To-Date: 2017-03-26T08:09:34Z WARC-Refers-To-Target-URI: https://static.lexpress.fr/min/images/logos/svg/lexpress.svg WARC-Record-ID: <urn:uuid:4294c205-0d27-439c-9599-ea8508306ad8> Content-Type: application/http; msgtype=response Content-Length: 695 HTTP/1.1 200 OK Content-Type: image/svg+xml Connection: close Server: nginx Date: Wed, 19 Apr 2017 11:01:37 GMT Last-Modified: Wed, 05 Apr 2017 09:11:45 GMT X-Backend: static1 X-CacheL2N: express.web.cache-back-02 HIT 6 (440306/31536000.000) Cache-Control: public, max-age=31556926 X-CacheL2: express.web.cache-back-02 MISS (0/31536000.000)
Data and metadata analysis - 2 Configuration Log Report (CLR) files are still there and similar to H1 significant changes from order.xml => crawler-beans.cxml
Template and crawler traps migration - 1 BnF - DLWEB - Umbra & Heritrix 3 Template and crawler traps migration - 1 Change from order.xml (XML objects/parameters structure) crawler-beans.cxml (beans/properties structure) Started with DK default template => BnF default.cxml with all beans we needed Then migrated our main templates (domain, host…) using the reference document that lists all differences Ended with specific (pressepayante, ftp, websocial) Gone from > 20 to only 7 templates simpleOverrides A bean for metadata, seeds, scope, processors, …
Template and crawler traps migration - 2 BnF - DLWEB - Umbra & Heritrix 3 Template and crawler traps migration - 2 Property names in H3 are similar as parameters in H1 Ex: delayFactor <= delay-factor H3 templates contain 11 NAS place holders (start with %{): Ex: crawler traps %{CRAWLERTRAPS_PLACEHOLDER} Ex: WARC writing %{ARCHIVER_PROCESSOR_BEAN_PLACEHOLDER} and %{ARCHIVER_BEAN_REFERENCE_PLACEHOLDER} Preparing the new templates is the most time consuming task Review and fix all global and domain specific crawler traps: (?i)^.*citer&page.*$ => ampersand will fail a job opened but not closed brackets… H3 parses and validates the job configuration before starting the job Other place holders: %{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER} %{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER} %{QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER} and %{QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER} %{MAX_TIME_SECONDS_PLACEHOLDER} %{MAX_HOPS} %{EXTRACT_JAVASCRIPT} %{HONOR_ROBOTS_DOT_TXT}
Software evolutions: NAS (besides H3 remote access) 3 new fields: max-hops, robots.txt, extract javascript in configuration
Software evolutions: others Other tools that are plugged to NAS, WARC data files, WARC metadata files We had to update: nas-qual (compiles production statistics from Heritrix reports) preservation ingest workflow (WARC revist records, changes in CLR) OpenWayback (WARC revist records)
Review of installation Update of NAS database (2 new tables eav_attribute, eav_type_attribute + isActive column on ordertemplates) Java 8 (1.8.0_40 at BnF) Changes in main deployment settings jar libraries reorganisation in deployGlobal section new heritrix3 sub-section in harvester section new heritrix3 libraries in deployMachine section minor differences database connexion
Questions ?
Spare slides
17