Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson.

Similar presentations


Presentation on theme: "Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson."— Presentation transcript:

1 Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson

2 The problem Existing WARC specification only allows reference to other records via WARC-Record-ID referenced by the WARC-Refers-To header in records citing other records No one has an index of these ids Existing deduplication records typically do not have these references

3 Simple revisit records Existing revisit records rely on the fact that the URI will always be the same as will the content digest For replay, it is assumed that the original is newest non-revisit record for that URL This works OK for URI based deduplication

4 The proposal For WARC ‘revisit’ records with WARC-Profile set to ‘identical- payload-digest’, the following fields should be viewed as strongly recommended: WARC-Refers-To-Target-URI This value should be equal to the WARC-Target-URI in the WARC record that the current record is considered a duplicate of. WARC-Refers-To-Date This value should be equal to the WARC-Date in the WARC record that the current record is considered a duplicate of. Additionally, the use of fields specifying the actual WARC file name and offsets where the record can be found should be discouraged as it is potentially very brittle.

5 Links The proposal: ▫https://docs.google.com/document/d/1QyQBA7Ykgxie75V8Jziz_O7hbhwf7PF6_u9O6w6zgp 0/edit?usp=sharinghttps://docs.google.com/document/d/1QyQBA7Ykgxie75V8Jziz_O7hbhwf7PF6_u9O6w6zgp 0/edit?usp=sharing OpenWayback prescription: ▫https://github.com/iipc/openwayback/wiki/How-OpenWayback-handles-revisit-records-in- WARC-fileshttps://github.com/iipc/openwayback/wiki/How-OpenWayback-handles-revisit-records-in- WARC-files Heritrix ▫Duplicate handling in Heritrix  https://docs.google.com/document/d/1vXcBK- TQLMHaaY5pfrovpleKfeigYH7XtLzZb8F7ZsM/edit?usp=sharing https://docs.google.com/document/d/1vXcBK- TQLMHaaY5pfrovpleKfeigYH7XtLzZb8F7ZsM/edit?usp=sharing ▫A way forward  https://docs.google.com/document/d/11F4zWiokFBcCuhmgye5lizQhdY5xUnoWAyZgd- cI840/edit?usp=sharing https://docs.google.com/document/d/11F4zWiokFBcCuhmgye5lizQhdY5xUnoWAyZgd- cI840/edit?usp=sharing


Download ppt "Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson."

Similar presentations


Ads by Google