Presentation is loading. Please wait.

Presentation is loading. Please wait.

Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015.

Similar presentations


Presentation on theme: "Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015."— Presentation transcript:

1 Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015

2 AIT WARC usage - compared with NAS, january 2015  AIT = archive-it using Heritrix 3.3.* + Umbra  NAS = NetarchiveSuite 4.4 using Heritrix/1.14.4  WARC records in focus: Info, metadata, request revisit (Not supported by NAS)

3 WARC-Info - all common fields and values are omitted AIT:  isPartOf: 4897-20141007162106207 (Our Ownerid at archive-it)  description: recurrence=NONE, maxDuration=259200, maxDocumentCount=null, isTestCrawl=false, isPatchCrawl=false, oneTimeSubtype=ONE_TIME, seedCount=5, accountId=871, accountType=SUBSCRIBER, organizationName="The Royal Library - Denmark", collectionId=4897, collectionName="div_javascript", collectionPublic=false  robots: obey ( this is not correct – we are actual ignoring robots)  http-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot; Archive-It; +http://archive-it.org/files/site-owners.html NAS:  operator: Admin  isPartOf: forsider ( The Name of the orderxml )  description: Special template harvesting only seeds  robots: ignore  http-header-user-agent: Mozilla/5.0 (compatible; heritrix/1.14.4 +http://netarkivet.dk/webcrawler/)  http-header-from: info@netarkivet.dk NAS WARC Info continues - next page

4 NAS continued WARC-Info (Not in AIT)  #added by NetarchiveSuite Version: 4.2.0 status RELEASE  harvestInfo.version: 0.4  harvestInfo.jobId: 209126  harvestInfo.priority: HIGHPRIORITY  harvestInfo.harvestNum: 56  harvestInfo.origHarvestDefinitionID: 122  harvestInfo.maxBytesPerDomain: 14000000000  harvestInfo.maxObjectsPerDomain: -1  harvestInfo.orderXMLName: forsider  harvestInfo.origHarvestDefinitionName: Engagnshøstninger  harvestInfo.scheduleName: Enkelt_gang  harvestInfo.harvestFilenamePrefix: 209126-122  harvestInfo.jobSubmitDate: 2014-06-12T11:30:24

5 Usage of request and metadata records  Seems to be used by Umbra Usage  Request: documentation of http-request  Metadata: documentation of links extracted from actual Target-Uri: using Heritrix format  How to activate in NAS settings:   false  true  ….. 

6 Request record (Not activated by default in DK NAS) Content part represents the http request header WARC-Type: request WARC-Target-URI: https://www.facebook.com/ajax/pagelet/generic.php/.... WARC-Date: 2014-10-27T12:35:43Z WARC-Concurrent-To: WARC-Record-ID: Content-Type: application/http; msgtype=request Content-Length: 1006 GET /ajax/pagelet/generic.php/PhotoViewerInitPagelet?ajaxpipe=…. Connection: Close Referer: https://www.facebook.com/socialdemokraterne Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Language: * X-DevTools-Emulate-Network-Conditions-Client-Id: 7E480C0F-86C9-DFE2-ACB1-ADADDA7F813A User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36 Host: www.facebook.com Cookie: datr=FzxOVIKQ4RcCgDMX0qmEwTn4; reg_fb_gate=https%3A%2F%2Fwww.facebook.com%2Fsocialdemokraterne; reg_fb_ref=https%3A%2F%2Fdevelopers.facebook.com%2F%3Fref%3Dpf

7 Metadata record (Not activated by default in DK NAS) Content part defined by Heritrix: WARC-Type: metadata WARC-Target-URI:https://www.facebook.com/ajax/pagelet/.... WARC-Date: 2014-10-27T12:35:43Z WARC-Concurrent-To: WARC-Record-ID: Content-Type: application/warc-fields Content-Length: 12403 via: https://www.facebook.com/socialdemokraterne hopsFromSeed: I sourceTag: https://www.facebook.com/socialdemokraterne fetchTimeMs: 423 charsetForLinkExtraction: UTF-8 outlink: whois:173.252.74.22 outlink: whois:facebook.com outlink: https://www.facebook.com/favicon.ico outlink: http://facebook.com/ outlink: https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xaf1/v/t1.0.....https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xaf1/v/t1.0 …..

8 Revisit record is not supported in NAS  NAS deduplications are only written in NAS crawllogs and not in the WARC files.  NAS deduplicates only non html/txt  NAS has a different indexing workflow

9 Revisit record (Not supported in DK NAS) Content part contains the http header for the revisit Target-Uri: WARC-Type: revisit WARC-Target-URI: http://media.wix.com/robots.txt WARC-Date: 2014-10-07T16:21:14Z WARC-Payload-Digest: sha1:L2WKNE6KZDY5XAUHGDRG4BW2KTMKKM5I WARC-IP-Address: 107.178.250.202 WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest WARC-Truncated: length WARC-Refers-To-Target-URI: http://static.wixstatic.com/robots.txt WARC-Refers-To-Date: 2014-10-07T16:21:13Z WARC-Refers-To: WARC-Record-ID: Content-Type: application/http; msgtype=response Content-Length: 790 HTTP/1.0 200 OK X-Seen-By: us-central1-a-mediaserver-pool-2439c-9xx2.c.wixpop-gce.internal-dispatcher_dsp Expires: Tue, 14 Oct 2014 16:21:14 GMT Date: Tue, 07 Oct 2014 16:21:14 GMT …….

10 Writing revisits in the crawllog, too 2014-10-27T12:35:39.661Z 200 3114 https://fbstatic- a.akamaihd.net/robots.txt IP https://fbstatic- a.akamaihd.net/rsrc.php/v2/yG/r/cIVQ7_V6Fiv.css text/plain #029 20141027123539416+209 sha1:ZSQCAWXRTTXDI73H6FCOXAX6ERE2RUPB https://www.facebook.com/socialdemokraterne duplicate:digest {"warcFileOffset":95661,"warcFilename":"ARCHIVEIT- 4897-NONE-8162-20141027123532971-00000-wbgrp- svc110.us.archive.org-6444.warc.gz"}

11 Questions 1.Should we change our format and what NAS writes in the WARC-Info header? 2.Should NAS by default activate request and metadata records? 3.Should we also write revisit records corresponding to NAS crawllog entries ( and is it possible according to the WARC specification)?


Download ppt "Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015."

Similar presentations


Ads by Google