Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), 2005

Slides:



Advertisements
Similar presentations
Reinventing using REST. Anything addressable by a URI is called a resource GET, PUT, POST, DELETE WebDAV (MOVE, LOCK)
Advertisements

Long-Term Preservation. Technical Approaches to Long-Term Preservation the challenge is to interpret formats a similar development: sound carriers From.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
Digital Preservation - Its all about the metadata right? “Metadata and Digital Preservation: How Much Do We Really Need?” SAA 2014 Panel Saturday, August.
SCIDIP-ES Components Oct ,Brussels. Basic Preservation Strategies Often stated as: “Emulate or Migrate” OAIS concepts change these to: Add Representation.
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College.
1 CS 502: Computing Methods for Digital Libraries Lecture 2 The Nomadic Computing Experiment Object Models.
A New Computing Paradigm. Overview of Web Services Over 66 percent of respondents to a 2001 InfoWorld magazine poll agreed that "Web services are likely.
Automatic Evaluation of Migration Quality in Distributed Networks of Converters Miguel Ferreira Supervisors Ana Alice Baptista.
1 Using Scalable and Secure Web Technologies to Design Global Format Registry Muluwork Geremew, Sangchul Song and Joseph JaJa Institute for Advanced Computer.
HTTP Hypertext Transfer Protocol. HTTP messages HTTP is the language that web clients and web servers use to talk to each other –HTTP is largely “under.
Different Streaming Technologies. Three major streaming technologies include:
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
1 Software Testing and Quality Assurance Lecture 32 – SWE 205 Course Objective: Basics of Programming Languages & Software Construction Techniques.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
 Proxy Servers are software that act as intermediaries between client and servers on the Internet.  They help users on private networks get information.
Different approaches to digital preservation Hilde van Wijngaarden Digital Preservation Officer Koninklijke Bibliotheek/ National Library of the Netherlands.
An Overview of Selected ISO Standards Applicable to Digital Archives Science Archives in the 21st Century 25 April 2007 Donald Sawyer - NASA/GSFC/NSSDC.
Computer Concepts 2014 Chapter 7 The Web and .
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for Library Automation, University of Florida DigCCurr2007.
Web Architecture Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
Unit 1 – Web Concepts Instructor: Brent Presley. ASSIGNMENT Read Chapter 1 Complete lab 1 – Installing Portable Apps.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Dynamic Web File Format Transformations with Grace Daniel S. Swaney, Frank McCown, and Michael L. Nelson Old Dominion University Computer Science Department.
Standards And Architectures For NOF Digitisation Projects Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by: .
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
File format registries - a global infrastructure for local persistence Andreas Aschenbrenner, ERPANET.
Implementor’s Panel: BL’s eJournal Archiving solution using METS, MODS and PREMIS Markus Enders, British Library DC2008, Berlin.
Use & Access 26 March Use “Proof of Concept” Model for General Libraries & IS faculty Model for General Libraries & IS faculty Test bed for DSpace.
John Mark OckerbloomMay 10, 2004 The Typed Object Model Support for diverse formats John Mark Ockerbloom File Formats for Preservation Seminar May 10,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
Http protocol Response-request Clients not limited to web browsers. Anything that can access code implementing the protocol works: –Standalone programs.
Small steps and lasting impact: making a start with preservation or It’s not all NASA Patricia Sleeman Digital Archives and Repositories University of.
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
The Story of at the Alaska State Library Presented by Sheri Somerville Alaska State Library March 14, 2009.
VITAL at the National Library of Wales Glen Robson
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
The New DRS Introduction. What is DRS? Digital repository for preservation and access – Maintains integrity of deposited content – Preserves content for.
Funded by: © AHDS Preservation in Institutional Repositories Preliminary conclusions of the SHERPA DP project Gareth Knight Digital Preservation Officer.
Operating Systems Lesson 12. HTTP vs HTML HTML: hypertext markup language ◦ Definitions of tags that are added to Web documents to control their appearance.
WEB SERVER Mark Kimmet Shana Blair. The Project Web Server Application  Receives request for web pages or images from a client browser via the internet.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Fundamentals.
M-1 INGEST OVERVIEW Don Sawyer National Space Science Data Center NASA/GSFC October 13, 1999.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
MIME SNIFFING ISSUES Larry Masinter IETF 82 Taipei November 16, 2011.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
1 Unraveling the Web: How Does it All Work?. 2 Web Enabling Technologies F TCP/IP network (Internet & others) F URLs F HTTP protocol and HTTP Servers.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Libraries in the digital age Collection & preservation for generational access part two The LOCKSS Program.
Data Management and Digital Preservation Carly Dearborn, MSIS Digital Preservation & Electronic Records Archivist
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Co-funded by the European Union under FP7-ICT Alliance Permanent Access to the Records of Science in Europe Network aparsen.eu #APARSEN Options.
Java Web Services Orca Knowledge Center – Web Service key concepts.
Ingest and Dissemination with DAITSS

Head of Digital Library The University of Edinburgh
An Introduction to Tessella and The Safety Deposit Box Platform
Open Archival Information System
Presentation transcript:

Transparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. Morabito Lib Magazine, 11(1), Slides by Frank McCown Old Dominion University March 17, 2005

Format Migration What is it? Conversion of older DO format to current format What other major digital preservation strategy could be used? Emulation Original DO format is preserved and presented to the user When should a DO be migrated to a new format? Format change does not imply obsolescence

Format Obsolescence of Web Content Web format is obsolete when widely used browsers can no longer present the content Backwards compatibility of browsers a must HTML 4 vs. XHTML Old Web formats die slowly How many can you think of? Emulation is difficult to implement Find older browser, original plug-in, etc.

Migration of Obsolete Formats Three migration points Migration on ingest Convert all incoming objects into selected format before preserving Batch migration Convert all preserved objects into new format when preserved format is perceived to be obsolete Migration on access Convert preserved object into new format on-the-fly when requested by a user

Migration Issues Keep original format in case conversion tool is later found to have a bug or lost vital info when converting Conversion tool should be preserved to document original format and in case bug is found in tool Choose migration format wisely – it can significantly reduce the need and cost for future migrations

The LOCKSS System LOCKSS 1 - Lots Of Copies Keep Stuff Safe™ Developed at Stanford University Open source, P2P software Used by libraries to ensure web accessible content (e- journals and open access material), remains available at all times Each peer collects material to preserve by crawling publisher’s web site Peers continually perform content consistency checks and repair content when needed Preserved material is transparently presented to user if publisher’s copy is not available (using web proxy) Currently used by 80 libraries worldwide 1

LOCKSS Format Migration Plug-in format converter registers input/output MIME types IANA MIME types - LOCKSS web proxy uses plug-in converters to perform on-the-fly conversion of obsolete formats (migration on access) Converters are preserved along with web content among peers

Proof of Concept Convert “obsolete” GIF images to PNG Proxy Web server prevents MIME type image/gif from matching any Accept: header Mismatch prompts conversion so content is delivered using the original URL but with Mime-Type=image/png. Images from Fig 1 and 2 at

HTTP Format Negotiation Browser can tell a web server a format is obsolete by telling it not to send that format HTTP/1.1 1 defines how web servers and client browsers negotiate the format, language, and encoding of web content Browser sends request using Accept: header listing acceptable MIME types of content format 1

Format Negotiation Examples Accept: text/plain;q=0.5, text/xml;q=0.8, text/html “I prefer text/html first, text/xml second, and finally text/plain.” */*;q=0.1 “If you can’t give me what I want, give me what you have.” image/*, image/gif;q=0 “Send me any kind of image except GIFs.” NOTE: q=0 semantics are not actually defined in HTTP/1.1

Format Negotiation Illustration Browser LOCKSS Proxy Web Server HTTP Request Accept: */*;q=0.1, image/gif;q=0 HTTP Response Content-Type: image/png GIF GIF to PNG Converter PNG I’ll take whatever you have except obsolete GIF images. All I have are GIFs. I’ll convert them to a format the browser can handle.

Future Work for LOCKSS Replace proof-of-concept implementation with complete implementation with API for plug-in converters Use a format migration service like TOM Use JHOVE format metadata extraction and validation technology to improve the quality of format metadata

TOM (Typed Object Model) Came from John Ockerbloom’s Ph.D. thesis at Carnegie Mellon 1 Currently managed by developers at Univ of Pennsylvania Library led by Ockerbloom Addresses the problem of increasingly new and obsolete data formats that makes using digital information problematic TOM makes it possible to Explain a data format Interpret the format for proper data extraction Convert the format into other formats 1

TOM Two components Data Model that describes data formats and operations that can be performed on them Networked software that supports the description and operations of the data formats Figure from

TOM Applications TOM example broker bin/typebrowse/showtype?broker=tom%2elibrary%2eupenn%2eedu& bin/typebrowse/showtype?broker=tom%2elibrary%2eupenn%2eedu& TOM Conversion Service Could be used by LOCKS for format migration Fred (Format Registry Demonstration)

JHOVE JSTOR/Harvard Object Validation Environment 1 Provides functions to perform format-specific identification, validation, and characterization of digital objects Identification What is the format of my digital object? Validation Is my digital object really of type X? Characterization What are the significant properties of my digital object of type X? GIF example

JHOVE Use in Repository Figure from Submission Information Package (SIP) - OAIS

JHOVE and LOCKSS JHOVE generates reliable format metadata LOCKSS can use JHOVE to extract quality metadata about the contents of its repository What if object to store is not valid? It may be easier to write a conversion tool using JHOVE to supply format metadata

Conclusion Goal is to ensure obsolete formats will not make current LOCKSS content inaccessible Migration on access can be done transparently to the user Format migration service like TOM can be used to perform conversions Use of JHOVE would improve quality of LOCKSS content metadata