Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson
The problem Existing WARC specification only allows reference to other records via WARC-Record-ID referenced by the WARC-Refers-To header in records citing other records No one has an index of these ids Existing deduplication records typically do not have these references
Simple revisit records Existing revisit records rely on the fact that the URI will always be the same as will the content digest For replay, it is assumed that the original is newest non-revisit record for that URL This works OK for URI based deduplication
The proposal For WARC ‘revisit’ records with WARC-Profile set to ‘identical- payload-digest’, the following fields should be viewed as strongly recommended: WARC-Refers-To-Target-URI This value should be equal to the WARC-Target-URI in the WARC record that the current record is considered a duplicate of. WARC-Refers-To-Date This value should be equal to the WARC-Date in the WARC record that the current record is considered a duplicate of. Additionally, the use of fields specifying the actual WARC file name and offsets where the record can be found should be discouraged as it is potentially very brittle.
Links The proposal: ▫ 0/edit?usp=sharinghttps://docs.google.com/document/d/1QyQBA7Ykgxie75V8Jziz_O7hbhwf7PF6_u9O6w6zgp 0/edit?usp=sharing OpenWayback prescription: ▫ WARC-fileshttps://github.com/iipc/openwayback/wiki/How-OpenWayback-handles-revisit-records-in- WARC-files Heritrix ▫Duplicate handling in Heritrix TQLMHaaY5pfrovpleKfeigYH7XtLzZb8F7ZsM/edit?usp=sharing TQLMHaaY5pfrovpleKfeigYH7XtLzZb8F7ZsM/edit?usp=sharing ▫A way forward cI840/edit?usp=sharing cI840/edit?usp=sharing