BnF - DLWEB - Umbra & Heritrix 3

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Refeng Wu CQ5 WCM System Administrator
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
MOSS 2007 Document Management Adam McCarthy 1 st April 2009.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
Status and plans for the H3 release NetarchiveSuite 5.0.
IWay Service Manager 6.1 Product Update Scott Hathaway iWay Software Copyright 2010, Information Builders. Slide 1.
GoldenGate Monitoring and Troubleshooting
Re-Architecting Search Solutions with SharePoint’s new Federation Features ITP314, CIO314, PM314, IA314.
Automating Bespoke Attack Ruei-Jiun Chapter 13. Outline Uses of bespoke automation ◦ Enumerating identifiers ◦ Harvesting data ◦ Web application fuzzing.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
1 Archive-It Training University of Maryland July 12, 2007.
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
Salesforce Change Management Best Practices
Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson.
Annick Le Follic Bibliothèque nationale de France Tallinn,
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 News and media websites harvesting. 2 A daily crawl since December 2010 The selective crawl contains 92 websites National daily newspapers (
Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015.
Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF.
Tool Academy: Web Archiving Nicholas Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby.
WADL 2013 July th Indianapolis, IN Martin SiteStory Archiving Done Differently
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
Annick Le Follic Bibliothèque nationale de France Tallinn,
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
NetarchiveSuite Sabine Schostag The Netarchive
Heritrix 3: librarian features BnF proposal March 2015.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
NAS_qual reports. 2 NAS_qual - 1 Java batch which works on Heritrix reports (extracted from metadata W/ARC files) Compiles a large set of figures and.
Module 9: Preparing to Administer a Server. Overview Introduction to Administering a Server Configuring Remote Desktop to Administer a Server Managing.
Sustainability: Web Site Statistics Marieke Napier UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: URL
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
ISpheresImage iSpheresImage Feature Overview and Progress Summary.
1 Chapter Overview Performing Configuration Tasks Setting Up Additional Features Performing Maintenance Tasks.
CyberCemetery Preserving At-Risk Government Web Content.
Documenting Internet2 an IT perspective Eric Celeste University of Minnesota (Twin Cities) Libraries for the Coalition for Networked Information 6 December.
NetarchiveSuite Meeting, Paris, * Austria Updates and Plans for 2014/2015 Michaela Mayr, Andreas Predikaka Austrian National Library.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
1 Advanced Archive-It Application Training: Crawl Scoping.
Running Kuali: A Technical Perspective Ailish Byrne (Indiana University) Jonathan Keller (University of California, Davis)
14th Oct 2005CERN AB Controls Development Process of Accelerator Controls Software G.Kruk L.Mestre, V.Paris, S.Oglaza, V. Baggiolini, E.Roux and Application.
Workforce Scheduling Release 5.0 for Windows Implementation Overview OWS Development Team.
ICM – API Server Gary Ratcliffe. 2 Agenda Webinar Programme API Server Overview JSON-RPC iCM API Service API Server and Forms New services under.
WEB SERVER SOFTWARE FEATURE SETS
Copyright © New Signature Who we are: Focused on consistently delivering great customer experiences. What we do: We help you transform your business.
Galaxy in Production Nate Coraor Galaxy Team Penn State University.
JAFER Toolkit Project Oxford University 1 JAFER Java-based high level Z39.50 toolkit Matthew Dovey; Colin Tatham; Antony Corfield; Richard Mawby Oxford.
Slide 1 © 2016, Lera Technologies. All Rights Reserved. Oracle Data Integrator By Lera Technologies.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
11 DEPLOYING AN UPDATE MANAGEMENT INFRASTRUCTURE Chapter 6.
Platform as a Service (PaaS)
Introduction to YouSeer
Module 9: Preparing to Administer a Server
Institution update KB DK
BnF experiences with harvesting content beyond paywalls
DAITSS and the Florida Digital Archive
BnF experiences in using NAS 5 And Heritrix 3
Getting Started with LANGuardian
Unit 27: Network Operating Systems
What Is Sharepoint? Mohsen Ashkboos
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
U.S. Environmental Protection Agency
Nate Nelson I*LEVEL, Inc.
DDP/DAP Design and Technology Overview
SSDT and Database Project Basics
Module 9: Preparing to Administer a Server
archiving of websites and born-digital documents in slovakia
QlikView for use with SAP Netweaver Version 5.8 New Features
Presentation transcript:

BnF - DLWEB - Umbra & Heritrix 3 NetarchiveSuite 5.3 Colin Rosenthal csr@kb.dk Sara Aubry sara.aubry@bnf.fr

Overview Heritrix 3 Integration in NetarchiveSuite Feedback on BnF migration from NAS 4

Heritrix 3 Integration in NetarchiveSuite https://sbforge.org/display/NAS/Heritrix+3+Integration+in+NAS

Feedback on BnF migration from NAS 4

Background Heritrix 1 has been in use at BnF since 2006 9 months project, started in July 2016 Tackled a variety of activities: lots and different kinds of tests data and metadata analysis template and crawler traps migration software evolutions organisation changes Get started by reading the release notes

Lots and different kinds of tests Appropriation: get a sense of what’s new/gone/different H3 is a much more technical tool Collections Are focused crawls working the same way/working better? faster Is the content quality the same/improved? less noise Do we crawl better specific content? https Tools Are there new features to prepare, monitor and QA crawls? Infrastructure Can we still run applications on our virtual server environment? yes Templates + crawler traps Can we still use our knowledge base? yes but…

Data and metadata analysis - 1 WARC revist records when using deduplication need to restart deduplication indices, impact on storage WARC/1.0 WARC-Type: revisit WARC-Target-URI: https://static.lexpress.fr/min/images/logos/svg/lexpress.svg WARC-Date: 2017-04-24T08:02:16Z WARC-IP-Address: 54.230.79.70 WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest WARC-Truncated: length WARC-Payload-Digest: 5GXMYH6VZWVSZRURVYXHWYJKNWUE65BR WARC-Refers-To-Date: 2017-03-26T08:09:34Z WARC-Refers-To-Target-URI: https://static.lexpress.fr/min/images/logos/svg/lexpress.svg WARC-Record-ID: <urn:uuid:4294c205-0d27-439c-9599-ea8508306ad8> Content-Type: application/http; msgtype=response Content-Length: 695 HTTP/1.1 200 OK Content-Type: image/svg+xml Connection: close Server: nginx Date: Wed, 19 Apr 2017 11:01:37 GMT Last-Modified: Wed, 05 Apr 2017 09:11:45 GMT X-Backend: static1 X-CacheL2N: express.web.cache-back-02 HIT 6 (440306/31536000.000) Cache-Control: public, max-age=31556926 X-CacheL2: express.web.cache-back-02 MISS (0/31536000.000)

Data and metadata analysis - 2 Configuration Log Report (CLR) files are still there and similar to H1 significant changes from order.xml => crawler-beans.cxml

Template and crawler traps migration - 1 BnF - DLWEB - Umbra & Heritrix 3 Template and crawler traps migration - 1 Change from order.xml (XML objects/parameters structure) crawler-beans.cxml (beans/properties structure) Started with DK default template => BnF default.cxml with all beans we needed Then migrated our main templates (domain, host…) using the reference document that lists all differences Ended with specific (pressepayante, ftp, websocial) Gone from > 20 to only 7 templates simpleOverrides A bean for metadata, seeds, scope, processors, …

Template and crawler traps migration - 2 BnF - DLWEB - Umbra & Heritrix 3 Template and crawler traps migration - 2 Property names in H3 are similar as parameters in H1 Ex: delayFactor <= delay-factor H3 templates contain 11 NAS place holders (start with %{): Ex: crawler traps %{CRAWLERTRAPS_PLACEHOLDER} Ex: WARC writing %{ARCHIVER_PROCESSOR_BEAN_PLACEHOLDER} and %{ARCHIVER_BEAN_REFERENCE_PLACEHOLDER} Preparing the new templates is the most time consuming task Review and fix all global and domain specific crawler traps: (?i)^.*citer&page.*$ => ampersand will fail a job opened but not closed brackets… H3 parses and validates the job configuration before starting the job Other place holders: %{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER} %{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER} %{QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER} and %{QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER} %{MAX_TIME_SECONDS_PLACEHOLDER} %{MAX_HOPS} %{EXTRACT_JAVASCRIPT} %{HONOR_ROBOTS_DOT_TXT}

Software evolutions: NAS (besides H3 remote access) 3 new fields: max-hops, robots.txt, extract javascript in configuration

Software evolutions: others Other tools that are plugged to NAS, WARC data files, WARC metadata files We had to update: nas-qual (compiles production statistics from Heritrix reports) preservation ingest workflow (WARC revist records, changes in CLR) OpenWayback (WARC revist records)

Review of installation Update of NAS database (2 new tables eav_attribute, eav_type_attribute + isActive column on ordertemplates) Java 8 (1.8.0_40 at BnF) Changes in main deployment settings jar libraries reorganisation in deployGlobal section new heritrix3 sub-section in harvester section new heritrix3 libraries in deployMachine section minor differences database connexion

Questions ?

Spare slides

17