Download presentation
Presentation is loading. Please wait.
Published byAbel Davis Modified over 9 years ago
1
11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April 28th, 2015
2
2 Summary of the presentation Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015
3
3 Summary of the presentation Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015
4
4 The WARC format A container format designed to store any kind of digital content –Along with relevant metadata –Extension of the ARC format designed in 1996 WARC improvements –Assigns a unique identifier to each record –New records types: To describe the harvesting process: warcinfo, request, response, metadata records To store information on deduplication: revisit records To store segmented files: continuation records To record outputs of a file format migration: conversion records To record non web material: resource records –New named fields for each records IIPC General Assembly – Stanford – April 28th, 2015
5
5 Usage of WARC format Widely adopted by the web archiving community –Most institutions have switched from ARC to WARC format –Harvesting: Heritrix, Wget, WARCcreateHeritrixWgetWARCcreate –Data management/preservation: JWAT, Jhove2JWATJhove2 –Indexing and access: SOLR, Open WaybackSOLROpen Wayback But also adopted beyond web archiving community –To store e-periodicals and e-books: LOCKSS project –To store all files ingested in a long-term repository: Danish Bit Repository Some usage issues discussed in the WARC implementation guidelinesWARC implementation guidelines IIPC General Assembly – Stanford – April 28th, 2015
6
6 The WARC standard Published as “ISO 28 500” on May 15 th, 2009 –Standardization process had started in 2006 –Mainly ensured by IIPC members under ISO umbrella ISO group: TC 46 / SC 4 / WG 12 –TC 46: Information and communication –SC 4: technical interoperability –WG 12: WARC file format ISO standards generally reviewed after 5 years –ISO members voted in 2014 in favor of the revision IIPC General Assembly – Stanford – April 28th, 2015
7
7 Summary of the presentation Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015
8
8 The revision process A maximum period of 36 months A two steps approach –IIPC draft / IIPC WG –ISO validated standard / ISO WG Proposed agenda in 2015 –WARC revision workshop: now! –June: presentation of revision process during TC46 meeting –May-September: first IIPC draft –October (?): ISO WG meeting IIPC General Assembly – Stanford – April 28th, 2015
9
9 The revision process – why? Amend or improve the current standard, on several topics –clarify potential ambiguities or inconsistencies in the standard; –offer better solutions to record some information, e.g. by adding new named fields or even new record types; –take into account some needs not identified when the original standard was designed (e.g. use of WARC for other documents than web archives); –perform minor editorial revisions. Afterwards, no change possible until the next revision! IIPC General Assembly – Stanford – April 28th, 2015
10
10 Summary of the presentation Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015
11
11 IIPC General Assembly – Stanford – April 28th, 2015
12
12 Revision needs – active discussions Clarification –Is it allowed to add new named fields? New record types are allowed… But nothing is indicated on new named fields Two new named fields for deduplication –WARC-Refers-To-Target-URI –WARC-Refers-To-Date A proposal to record screenshots? IIPC General Assembly – Stanford – April 28th, 2015
13
13 Revision needs – WARC for data mining WAT: Web Archive Transformation –Specified by Internet Archive to store metadata extracted from WARC files –Metadata (HTML headers, HTML metadata, links…) recorded in metadata records with a JSON structure WET: WARC Encapsulated Text –Designed by Common Crawl –Contains only text content extracted from WARC files Official recommendation as informative appendix? IIPC General Assembly – Stanford – April 28th, 2015
14
14 Revision needs – open questions Is WARC format suited for non-web material? Is WARC format suited for server side archiving? How to improve the use of unique IDs? IIPC General Assembly – Stanford – April 28th, 2015
15
15 Summary of the presentation Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015
16
16 Next steps Set up a working group: who’s in? –Should we share the work? What tools? –Using IIPC Github? Agenda? –Phone calls? IIPC General Assembly – Stanford – April 28th, 2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.