Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015
AIT WARC usage - compared with NAS, january 2015 AIT = archive-it using Heritrix 3.3.* + Umbra NAS = NetarchiveSuite 4.4 using Heritrix/ WARC records in focus: Info, metadata, request revisit (Not supported by NAS)
WARC-Info - all common fields and values are omitted AIT: isPartOf: (Our Ownerid at archive-it) description: recurrence=NONE, maxDuration=259200, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=ONE_TIME, seedCount=5, accountId=871, accountType=SUBSCRIBER, organizationName="The Royal Library - Denmark", collectionId=4897, collectionName="div_javascript", collectionPublic=false robots: obey ( this is not correct – we are actual ignoring robots) http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; + NAS: operator: Admin isPartOf: forsider ( The Name of the orderxml ) description: Special template harvesting only seeds robots: ignore http-header-user-agent: Mozilla/5.0 (compatible; heritrix/ http-header-from: NAS WARC Info continues - next page
NAS continued WARC-Info (Not in AIT) #added by NetarchiveSuite Version: status RELEASE harvestInfo.version: 0.4 harvestInfo.jobId: harvestInfo.priority: HIGHPRIORITY harvestInfo.harvestNum: 56 harvestInfo.origHarvestDefinitionID: 122 harvestInfo.maxBytesPerDomain: harvestInfo.maxObjectsPerDomain: -1 harvestInfo.orderXMLName: forsider harvestInfo.origHarvestDefinitionName: Engagnshøstninger harvestInfo.scheduleName: Enkelt_gang harvestInfo.harvestFilenamePrefix: harvestInfo.jobSubmitDate: T11:30:24
Usage of request and metadata records Seems to be used by Umbra Usage Request: documentation of http-request Metadata: documentation of links extracted from actual Target-Uri: using Heritrix format How to activate in NAS settings: false true …..
Request record (Not activated by default in DK NAS) Content part represents the http request header WARC-Type: request WARC-Target-URI: WARC-Date: T12:35:43Z WARC-Concurrent-To: WARC-Record-ID: Content-Type: application/http; msgtype=request Content-Length: 1006 GET /ajax/pagelet/generic.php/PhotoViewerInitPagelet?ajaxpipe=…. Connection: Close Referer: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Language: * X-DevTools-Emulate-Network-Conditions-Client-Id: 7E480C0F-86C9-DFE2-ACB1-ADADDA7F813A User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/ (KHTML, like Gecko) Ubuntu Chromium/ Chrome/ Safari/ Host: Cookie: datr=FzxOVIKQ4RcCgDMX0qmEwTn4; reg_fb_gate=https%3A%2F%2Fwww.facebook.com%2Fsocialdemokraterne; reg_fb_ref=https%3A%2F%2Fdevelopers.facebook.com%2F%3Fref%3Dpf
Metadata record (Not activated by default in DK NAS) Content part defined by Heritrix: WARC-Type: metadata WARC-Target-URI: WARC-Date: T12:35:43Z WARC-Concurrent-To: WARC-Record-ID: Content-Type: application/warc-fields Content-Length: via: hopsFromSeed: I sourceTag: fetchTimeMs: 423 charsetForLinkExtraction: UTF-8 outlink: whois: outlink: whois:facebook.com outlink: outlink: outlink: …..
Revisit record is not supported in NAS NAS deduplications are only written in NAS crawllogs and not in the WARC files. NAS deduplicates only non html/txt NAS has a different indexing workflow
Revisit record (Not supported in DK NAS) Content part contains the http header for the revisit Target-Uri: WARC-Type: revisit WARC-Target-URI: WARC-Date: T16:21:14Z WARC-Payload-Digest: sha1:L2WKNE6KZDY5XAUHGDRG4BW2KTMKKM5I WARC-IP-Address: WARC-Profile: WARC-Truncated: length WARC-Refers-To-Target-URI: WARC-Refers-To-Date: T16:21:13Z WARC-Refers-To: WARC-Record-ID: Content-Type: application/http; msgtype=response Content-Length: 790 HTTP/ OK X-Seen-By: us-central1-a-mediaserver-pool-2439c-9xx2.c.wixpop-gce.internal-dispatcher_dsp Expires: Tue, 14 Oct :21:14 GMT Date: Tue, 07 Oct :21:14 GMT …….
Writing revisits in the crawllog, too T12:35:39.661Z a.akamaihd.net/robots.txt IP a.akamaihd.net/rsrc.php/v2/yG/r/cIVQ7_V6Fiv.css text/plain # sha1:ZSQCAWXRTTXDI73H6FCOXAX6ERE2RUPB duplicate:digest {"warcFileOffset":95661,"warcFilename":"ARCHIVEIT NONE wbgrp- svc110.us.archive.org-6444.warc.gz"}
Questions 1.Should we change our format and what NAS writes in the WARC-Info header? 2.Should NAS by default activate request and metadata records? 3.Should we also write revisit records corresponding to NAS crawllog entries ( and is it possible according to the WARC specification)?