Download presentation
Presentation is loading. Please wait.
Published byKathleen Chase Modified over 9 years ago
1
Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015
2
AIT WARC usage - compared with NAS, january 2015 AIT = archive-it using Heritrix 3.3.* + Umbra NAS = NetarchiveSuite 4.4 using Heritrix/1.14.4 WARC records in focus: Info, metadata, request revisit (Not supported by NAS)
3
WARC-Info - all common fields and values are omitted AIT: isPartOf: 4897-20141007162106207 (Our Ownerid at archive-it) description: recurrence=NONE, maxDuration=259200, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=ONE_TIME, seedCount=5, accountId=871, accountType=SUBSCRIBER, organizationName="The Royal Library - Denmark", collectionId=4897, collectionName="div_javascript", collectionPublic=false robots: obey ( this is not correct – we are actual ignoring robots) http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html NAS: operator: Admin isPartOf: forsider ( The Name of the orderxml ) description: Special template harvesting only seeds robots: ignore http-header-user-agent: Mozilla/5.0 (compatible; heritrix/1.14.4 +http://netarkivet.dk/webcrawler/) http-header-from: info@netarkivet.dk NAS WARC Info continues - next page
4
NAS continued WARC-Info (Not in AIT) #added by NetarchiveSuite Version: 4.2.0 status RELEASE harvestInfo.version: 0.4 harvestInfo.jobId: 209126 harvestInfo.priority: HIGHPRIORITY harvestInfo.harvestNum: 56 harvestInfo.origHarvestDefinitionID: 122 harvestInfo.maxBytesPerDomain: 14000000000 harvestInfo.maxObjectsPerDomain: -1 harvestInfo.orderXMLName: forsider harvestInfo.origHarvestDefinitionName: Engagnshøstninger harvestInfo.scheduleName: Enkelt_gang harvestInfo.harvestFilenamePrefix: 209126-122 harvestInfo.jobSubmitDate: 2014-06-12T11:30:24
5
Usage of request and metadata records Seems to be used by Umbra Usage Request: documentation of http-request Metadata: documentation of links extracted from actual Target-Uri: using Heritrix format How to activate in NAS settings: false true …..
6
Request record (Not activated by default in DK NAS) Content part represents the http request header WARC-Type: request WARC-Target-URI: https://www.facebook.com/ajax/pagelet/generic.php/.... WARC-Date: 2014-10-27T12:35:43Z WARC-Concurrent-To: WARC-Record-ID: Content-Type: application/http; msgtype=request Content-Length: 1006 GET /ajax/pagelet/generic.php/PhotoViewerInitPagelet?ajaxpipe=…. Connection: Close Referer: https://www.facebook.com/socialdemokraterne Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Language: * X-DevTools-Emulate-Network-Conditions-Client-Id: 7E480C0F-86C9-DFE2-ACB1-ADADDA7F813A User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36 Host: www.facebook.com Cookie: datr=FzxOVIKQ4RcCgDMX0qmEwTn4; reg_fb_gate=https%3A%2F%2Fwww.facebook.com%2Fsocialdemokraterne; reg_fb_ref=https%3A%2F%2Fdevelopers.facebook.com%2F%3Fref%3Dpf
7
Metadata record (Not activated by default in DK NAS) Content part defined by Heritrix: WARC-Type: metadata WARC-Target-URI:https://www.facebook.com/ajax/pagelet/.... WARC-Date: 2014-10-27T12:35:43Z WARC-Concurrent-To: WARC-Record-ID: Content-Type: application/warc-fields Content-Length: 12403 via: https://www.facebook.com/socialdemokraterne hopsFromSeed: I sourceTag: https://www.facebook.com/socialdemokraterne fetchTimeMs: 423 charsetForLinkExtraction: UTF-8 outlink: whois:173.252.74.22 outlink: whois:facebook.com outlink: https://www.facebook.com/favicon.ico outlink: http://facebook.com/ outlink: https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xaf1/v/t1.0.....https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xaf1/v/t1.0 …..
8
Revisit record is not supported in NAS NAS deduplications are only written in NAS crawllogs and not in the WARC files. NAS deduplicates only non html/txt NAS has a different indexing workflow
9
Revisit record (Not supported in DK NAS) Content part contains the http header for the revisit Target-Uri: WARC-Type: revisit WARC-Target-URI: http://media.wix.com/robots.txt WARC-Date: 2014-10-07T16:21:14Z WARC-Payload-Digest: sha1:L2WKNE6KZDY5XAUHGDRG4BW2KTMKKM5I WARC-IP-Address: 107.178.250.202 WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest WARC-Truncated: length WARC-Refers-To-Target-URI: http://static.wixstatic.com/robots.txt WARC-Refers-To-Date: 2014-10-07T16:21:13Z WARC-Refers-To: WARC-Record-ID: Content-Type: application/http; msgtype=response Content-Length: 790 HTTP/1.0 200 OK X-Seen-By: us-central1-a-mediaserver-pool-2439c-9xx2.c.wixpop-gce.internal-dispatcher_dsp Expires: Tue, 14 Oct 2014 16:21:14 GMT Date: Tue, 07 Oct 2014 16:21:14 GMT …….
10
Writing revisits in the crawllog, too 2014-10-27T12:35:39.661Z 200 3114 https://fbstatic- a.akamaihd.net/robots.txt IP https://fbstatic- a.akamaihd.net/rsrc.php/v2/yG/r/cIVQ7_V6Fiv.css text/plain #029 20141027123539416+209 sha1:ZSQCAWXRTTXDI73H6FCOXAX6ERE2RUPB https://www.facebook.com/socialdemokraterne duplicate:digest {"warcFileOffset":95661,"warcFilename":"ARCHIVEIT- 4897-NONE-8162-20141027123532971-00000-wbgrp- svc110.us.archive.org-6444.warc.gz"}
11
Questions 1.Should we change our format and what NAS writes in the WARC-Info header? 2.Should NAS by default activate request and metadata records? 3.Should we also write revisit records corresponding to NAS crawllog entries ( and is it possible according to the WARC specification)?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.