11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April.

Slides:



Advertisements
Similar presentations
COUNTER: improving usage statistics Peter Shepherd Director COUNTER December 2006.
Advertisements

WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
IRRA DSpace April 2006 Claire Knowles University of Edinburgh.
OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
Accessing Distributed Resources Information: An OLAC perspective Steven Bird Gary Simons Chu-Ren Huang Melbourne SIL Academia Sinica ENABLER/ELSNET Workshop.
The Seven Pillars of Open Language Archiving: A Vision Statement Gary Simons and Steven Bird Workshop on Web-based Language Documentation and Description.
OLAC Process and OLAC Protocol: A Guided Tour Gary F. Simons SIL International ___________________________ OLAC Workshop 10 Dec 2002, Philadelphia.
Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.
Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.
BnF projects and priorities On the collection side – Perform broad and focused crawls with a maximum of 100TB – Set up the legal deposit of ebooks.
Geospatial One-Stop A Federal Gateway to Federal, State & Local Geographic Data
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
MEDIN Standards M. Charlesworth and the MEDIN Standards Working Group.
PubMed Central Mahyar Ahmadpour-B. Kowsar Publicatin Corp. Kowsar Editorial Meeting 1 September 19th, 2013 Tehran, Iran.
Interoperability and Preservation with the Hub and Spoke (HandS) Matt Cordial, Tom Habing, Bill Ingram, Robert Manaster University of Illinois Urbana-Champaign.
Interoperability and Preservation with the Hub and Spoke (HandS) Tom Habing, Bill Ingram, Robert Manaster University of Illinois Urbana-Champaign
1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.
1 Archiving Workflow between a Local Repository and the National Library Archive Experiences from the DiVA Project Eva Müller, Peter Hansson, Uwe Klosa,
Esri UC 2014 | Technical Workshop | Leveraging Metadata Standards for Supporting Interoperability in ArcGIS Aleta Vienneau, David Danko.
Dublin Core as a tool for interoperability Common presentation of data from archives, libraries and museums DC October 2006 Leif Andresen Danish.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
The future of interoperability for ILL and resource sharing by Clare Mackeigan Relais International.
Digital preservation Hydra Europe, LSE 24 April 2015 Anders Conrad.
RECORDS MANAGEMENT AND THE WEB Presented by Jennifer Wright, Archives and Information Management Team and Lynda Schmitz Fuhrig, Electronic Records Division.
Standardizing the Recording of Arbitrary Duplicates in WARC Files IIPC - Harvesting Working Group 2014 General Assembly - Paris Kristinn Sigurðsson.
Annick Le Follic Bibliothèque nationale de France Tallinn,
ISO as the metadata standard for Statistics South Africa
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Archive-it WARC usage - compared with NAS – and 3 Questions. By Tue Hejlskov Larsen, netarchive.dk January 2015.
1 Proposed PLCS TC Organization and Functional Responsibilities Revision
Tool Academy: Web Archiving Nicholas Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby.
5-7 November 2014 ADLSN - ADLC Practical Digital Content Management from Digital Libraries & Archives Perspective.
IIPC GA, Stanford, US - WARCApril 28 th 2015Slide 1 WARC as Package Format for all Preserved Digital Material by Eld Zierau The Royal Library of Denmark.
PREMIS Implementation at The Royal Library of Denmark by Eld Zierau.
Persistent Digital Archives and Library System (PeDALS) SC Department of Archives and History.
1 Guidelines For The Future Sharing Best Practice For National Bibliographies In The Digital Era Neil Wilson Information Coordinator IFLA Bibliography.
Annick Le Follic Bibliothèque nationale de France Tallinn,
IIPC GA Curator Tools Fair May 2014 WEB CURATOR TOOL Nicola Bingham Web Archivist.
Plans for 2015 Tallinn, Jan 29 th, 2015 Ditte Laursen, Sabine Schostag,
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
07/11/2002Thomas Baron - JACoW Workshop1 CERN Library Requirements T. Baron CERN ETT-DH-CDS.
Esri UC2013. Technical Workshop. Technical Workshop 2013 Esri International User Conference July 8–12, 2013 | San Diego, California Leveraging Metadata.
ISO edition 2 Publication plan R. Bodington Eurostep Limited ISO edition 2.
CyberCemetery Preserving At-Risk Government Web Content.
Access and Query Task Force Status at F2F1 Simon Miles.
Standards for Technology in Automotive Retail STAR Update Michelle Vidanes STAR XML Data Architect April 30 th, 2008.
9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Workshop on How to Publish Data in VO ESAC, June 25-June Tips & tricks by the Ivoa Document Coordinator Bruno Rino
Access and Query Task Force Status at F2F1 Simon Miles.
4/26/2017 Project: IEEE P Working Group for Wireless Personal Area Networks (WPANs) Submission Title: Response to WG request regarding TC ERM requested.
1 NetarchiveSuite Workshop Paris November , 2011.
1 Pioneer Investments Legal and Compliance System Assessment Weekly Status Update June 23, 2005.
Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
SEDAC Long-Term Archive Development Robert R. Downs Socioeconomic Data and Applications Center Center for International Earth Science Information Network.
Leveraging the Expertise of our Staff and the Information Resources We Manage MIT Libraries Visiting Committee April 13, 2005.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Data mining in web applications
IAF TC Report to PAC TC Summary of progress (June 2017)
Institution update KB DK
Using E-Business Suite Attachments
Validation & conformity testing
Alison Valk Georgia Tech
MARINE STRATEGY FRAMEWORK DIRECTIVE (MSFD) COMMON IMPLEMENTATION STRATEGY Capturing metadata: Implementation of MSFD art – via a metadata catalogue.
Márton Németh – László Drótos How to catalogue a web archive?
5/6/2019 Project: IEEE P Working Group for Wireless Personal Area Networks (WPANs) Submission Title: Response to WG request regarding TC ERM requested.
5/12/2019 Project: IEEE P Working Group for Wireless Personal Area Networks (WPANs) Submission Title: Response to WG request regarding TC ERM requested.
Presentation transcript:

11 WARC standard revision workshop Clément Oury IIPC General Assembly open workshops Stanford, April 28th, 2015 IIPC General Assembly – Stanford – April 28th, 2015

2 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

3 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

4 The WARC format  A container format designed to store any kind of digital content –Along with relevant metadata –Extension of the ARC format designed in 1996  WARC improvements –Assigns a unique identifier to each record –New records types: To describe the harvesting process: warcinfo, request, response, metadata records To store information on deduplication: revisit records To store segmented files: continuation records To record outputs of a file format migration: conversion records To record non web material: resource records –New named fields for each records IIPC General Assembly – Stanford – April 28th, 2015

5 Usage of WARC format  Widely adopted by the web archiving community –Most institutions have switched from ARC to WARC format –Harvesting: Heritrix, Wget, WARCcreateHeritrixWgetWARCcreate –Data management/preservation: JWAT, Jhove2JWATJhove2 –Indexing and access: SOLR, Open WaybackSOLROpen Wayback  But also adopted beyond web archiving community –To store e-periodicals and e-books: LOCKSS project –To store all files ingested in a long-term repository: Danish Bit Repository  Some usage issues discussed in the WARC implementation guidelinesWARC implementation guidelines IIPC General Assembly – Stanford – April 28th, 2015

6 The WARC standard  Published as “ISO ” on May 15 th, 2009 –Standardization process had started in 2006 –Mainly ensured by IIPC members under ISO umbrella  ISO group: TC 46 / SC 4 / WG 12 –TC 46: Information and communication –SC 4: technical interoperability –WG 12: WARC file format  ISO standards generally reviewed after 5 years –ISO members voted in 2014 in favor of the revision IIPC General Assembly – Stanford – April 28th, 2015

7 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

8 The revision process  A maximum period of 36 months  A two steps approach –IIPC draft / IIPC WG –ISO validated standard / ISO WG  Proposed agenda in 2015 –WARC revision workshop: now! –June: presentation of revision process during TC46 meeting –May-September: first IIPC draft –October (?): ISO WG meeting IIPC General Assembly – Stanford – April 28th, 2015

9 The revision process – why?  Amend or improve the current standard, on several topics –clarify potential ambiguities or inconsistencies in the standard; –offer better solutions to record some information, e.g. by adding new named fields or even new record types; –take into account some needs not identified when the original standard was designed (e.g. use of WARC for other documents than web archives); –perform minor editorial revisions.  Afterwards, no change possible until the next revision! IIPC General Assembly – Stanford – April 28th, 2015

10 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

11 IIPC General Assembly – Stanford – April 28th, 2015

12 Revision needs – active discussions  Clarification –Is it allowed to add new named fields? New record types are allowed… But nothing is indicated on new named fields  Two new named fields for deduplication –WARC-Refers-To-Target-URI –WARC-Refers-To-Date  A proposal to record screenshots? IIPC General Assembly – Stanford – April 28th, 2015

13 Revision needs – WARC for data mining  WAT: Web Archive Transformation –Specified by Internet Archive to store metadata extracted from WARC files –Metadata (HTML headers, HTML metadata, links…) recorded in metadata records with a JSON structure  WET: WARC Encapsulated Text –Designed by Common Crawl –Contains only text content extracted from WARC files  Official recommendation as informative appendix? IIPC General Assembly – Stanford – April 28th, 2015

14 Revision needs – open questions  Is WARC format suited for non-web material?  Is WARC format suited for server side archiving?  How to improve the use of unique IDs? IIPC General Assembly – Stanford – April 28th, 2015

15 Summary of the presentation  Current status of the WARC standard  The revision process  Identify, discuss and prioritize revision needs  Set up an organization and agenda for further work IIPC General Assembly – Stanford – April 28th, 2015

16 Next steps  Set up a working group: who’s in? –Should we share the work?  What tools? –Using IIPC Github?  Agenda? –Phone calls? IIPC General Assembly – Stanford – April 28th, 2015